Apache Mahout

Introduction to Apache Mahout: It started as a sub project of Lucene. (More to do with Hadoop.) Sean Owen started Taste Project ( Recommender Systems ) in 2005.

What is covered in this guide?

Setting up Apache Mahout 0.9 with Apache Hadoop 1.2.1
Basics: Similarity / Distance / Dissimilarity Metrics
Recommendation Algorithms
Clustering Algorithms
Classification Algorithms

Setting up

Requirements for Setting Up:

Java
IDE - Eclipse
Maven
Mahout distribution ( preferably the latest version )
Hadoop

Setting up Hadoop + Mahout

We need to setup Apache Hadoop 1.2.1 and Apache Mahout

Setup Hadoop: descripbed here
Setup Mahout ( 0.9 )

Apache Mahout setup

$ wget -c http://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz
$ tar zxf mahout-distribution-0.9.tar.gz
$ cd mahout-distribution-0.9

Setup env variables:

export HADOOP_HOME=/path/to/hadoop-1.2.1
export PATH=$HADOOP_HOME/bin:$PATH
export MAHOUT_HOME=/path/to/mahout-distribution-0.9
export PATH=$MAHOUT_HOME/bin:$PATH

First example

Canopy Clustering example

$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
$ hadoop fs -mkdir /canopy-testdata
$ hadoop fs -copyFromLocal /home/tuxdna/work/learn/external/synthetic_control.data /canopy-testdata/
$ mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job -i hdfs://localhost:9000/canopy-testdata/ -o /canopy-output -t1 80 -t2 55

Dump the clusters

$ mahout clusterdump -i hdfs://localhost:9000/canopy-output/clusters-*-final  -o /home/tuxdna/clusteranalyze.txt

we can also download outut sequence files

$ hadoop fs -get hdfs://localhost:9000/canopy-output ~/Downloads/
$ mahout clusterdump -i /home/tuxdna/Downloads/output/clusters-0-final/ --pointsDir /home/tuxdna/Downloads/output/clusteredPoints/ -o /home/tuxdna/Downloads/output/clusteranalyze.txt

Got this error ? Fixed by this

Unexpected --seqFileDir while processing Job-Specific Options:

Fixed command

$ mahout clusterdump -i output/clusters-0-final/ --pointsDir output/clusteredPoints/ --output /tmp/clusteranalyze.txt
Running on hadoop, using /home/tuxdna/work/learn/external/hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/tuxdna/work/learn/external/mahout-distribution-0.7/mahout-examples-0.7-job.jar
13/05/22 18:45:10 INFO common.AbstractJob: Command line arguments: {--dictionaryType=[text], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/home/tuxdna/Downloads/output/clusters-0-final/], --output=[/home/tuxdna/Downloads/output/clusteranalyze.txt], --outputFormat=[TEXT], --pointsDir=[/home/tuxdna/Downloads/output/clusteredPoints/], --startPhase=[0], --tempDir=[temp]}
13/05/22 18:45:10 INFO clustering.ClusterDumper: Wrote 0 clusters
13/05/22 18:45:10 INFO driver.MahoutDriver: Program took 629 ms (Minutes: 0.010483333333333334)

KMeans clustering on Text corpus:

$ mahout seqdirectory -i file:/full_txt/ -o file:/out-seqdir -c UTF-8 -chunk 5
$ mahout seq2sparse -i file:/out-seqdir -o /out-seqdir-sparse --maxDFPercent 8 --namedVector
$ mahout kmeans -ow -i /out-seqdir-sparse/tfidf-vectors/ -c /out-kmeans-clusters  -o /out-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering -cl
$ mahout clusterdump -i /out-kmeans/clusters-*-final -d /out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir /out-kmeans/clusteredPoints -o /out-output.txt

Similarity metrics

We will discuss following

Pearson correlation
Euclidean distance
Cosine measure
Spearman correlation
Tanimoto coefficient
Log likelihood test

Pearson correlation

Pearson Correlation definition: The value is in the range -1.0 to 1.0

Pearson correlation of two series is the ratio of their covariance to the product of their variances.

Perason correlation problems:

it doesn’t take into account the number of items in which two users preferences overlap
if two users overlap on only one item, no correlation can be computed because of how the computation is defined
the correlation is also undefined if either series of preference values are all identical

There is also a weighted pearson correlation metric, which moves correlation towards -1 or +1, when the preferences for some feathres are missing.

Euclidean distance

Euclidean distance is geometric distance between two n-dimensional points. This is a distance metric, so to convert it into a similarity metric, it is defined as

s = 1 / (1 + d )
where s = similarity
and d = euclidean distance

Cosine measure

Cosine measure is the cos(theta) of angle theta between two points with respect to origin. This is essentially implemented as Pearson Correlation metric in Mahout, mentioned above.

Spearman correlation

Spearman correlation - defining similarity by relative rank

Tanimoto coefficient

Also know as Jackard coefficent is:

tc = intersection of preferences / union of preferences

Log likelihood test

It is quite similar to Tanimoto Coefficient, which measures an overlap. However, Log Likelihood test, is an expression of how unlikely are will users have so much overlap, given the total number of items and the number of items each user has a preference for.

Two similar users are likely to rate a movie common to them. However two dissimilar users are unlikely to rate a common movie. Therefore the more unilkely, the more similar two users shoud be. The resulting value can be interpreted as a probability that an overlap isn’t just due to chance.

Distance measures

Distance / dissimilarity measures

Euclidean Distance: Defined above.
Squared Euclidean Distance: d = (a1 - b1)^2 + (a2 - b2)^2 + ... + (an - bn) ^ 2
Manhattan Distance: d = (a1 - b1) + (a2 - b2) + ... + (an - bn)
Cosine Distance: Defined above
Tanimoto Distance: Defined above
Weighted Distance Measure: Assigns weights for different features in Euclidean or Manhattan Distances. Define a weight Vector, which has weight factor values.

Classes of Clustering Algorithms

generative algorithm: fit the model to the data. using the model, that data can be generated which fits the model. example: LDA

discriminative algorithm: fit the data to the model; such as split the data into k sets based on some distance metric. example: hierarchical, k-means, SVM etc.

Recommendation Algorithms

Input data takes a form of preferences. Preference is a tuple of three values: userId, itemId, preferenceValue. Mahout has its own data structures to store these values efficiently. It has FastMap, FastByIdMap, FastIDSet etc. The preferences are stored in memory using GenericPreference, GenericPreferenceArray etc.

These values can be loaded from files or from databases or other storage media. For file backed storage, Mahout has FileDataModel class. It is simple to use.

For recommendation algorithms, first we create a DataModel for user, item and preferences. Then we select a similarity measure i.e. UserSimilarity or ItemSimilarity. Next, we select a class that creates a neighbourhood of users or items based on a certain criteria. This criteria can be a threshold value of similarity or a limit on top N items.

So we have a DataModel, a UserSimilarity ( or ItemSimilarity ) and a UserNeighborhood. Finally we create Recommender. Mahout has many different kinds of recommenders already built into it e.g.: GenericUserBasedRecommender, SlopeOneRecommender etc.

Order of business here is:

Choose a data model and load preference data
Choose a similiarity metric
Define a neighborhood
Choose an algorithm implementation for recommendations
Now we are ready for recommendations.

Another point to note is that we would need to evaluate the quality of recommendations that Mahout is giving us. For this there is a recommender evaluation framework already in place. Two metrics are:

RecommenderEvaluator: measures the divergence of estimated recommendation value from the actual recommendation value
RecommenderIRStatsEvaluator: measures precision and recall of the recommendation

Apache Mahout has all of this already implemented.

Lets look at MySQL based data model.

MySQL Data Model

Taste preferences table schema is as follows

CREATE TABLE taste_preferences (
  user_id BIGINT NOT NULL,
      item_id BIGINT NOT NULL,
      preference FLOAT NOT NULL,
      PRIMARY KEY (user_id, item_id),
      INDEX (user_id),
      INDEX (item_id));

Setup a Database and table

mysql> create database mia01;
Query OK, 1 row affected (0.00 sec)

mysql> use mia01;
Database changed

mysql> CREATE TABLE taste_preferences ( user_id BIGINT NOT NULL, item_id BIGINT NOT NULL, preference FLOAT NOT NULL, PRIMARY KEY (user_id, item_id), INDEX (user_id), INDEX (item_id));
Query OK, 0 rows affected (0.11 sec)

Create item similarity table:

CREATE TABLE taste_item_similarity (
  item_id_a BIGINT NOT NULL,
  item_id_b BIGINT NOT NULL,
  similarity FLOAT NOT NULL,
  PRIMARY KEY (item_id_a, item_id_b));

Components and their compatibility

Boolean Preference Data Model

is not compatible with PearsonCorrelation / EuclideanDistance
is compatible with LogLikelihood

In case of zero difference in estimate and actual preference, by the evauluator, we can always do a precision / recall evaluation — using IRStats evaluator.

User based recommender

User based recommenders first finds similar users and then sees what they like.

Algorithm: User-Based Recommender

for every item i that u has no preference for yet
  for every other user v that has a preference for i
    compute a similarity s between u and v
    incorporate v's preference for i, weighted by s, into a running average
return the top items, ranked by weighted average

Basic algorithm with a user neighbourhood:

for every other user w
  compute a similarity s between u and w
  retain the top users, ranked by similarity, as a neighbourhood n

for item i in neighbourhood except the ones rated by u
  for user v in neighbourhood who has a preference for i
    compute a similarity s between u and v
    incorporate v's preference for i, weighted by s, into a running average

return the top items, ranked by weighted average

Kind of neighbourhood metrics:

fixed size
threshold based

It is also possible to infer values for missing preferences in Mahout. This is achieveable using PreferenceInferer implementation such as AveragePreferenceInferer.

Item based recommender

Item based recommenders first sees what the user likes and then finds similar items.

Algorithm: Item-Based Recommender

for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average

return the top items, ranked by weighted average

Note: the running time of an item-based recommender scales up as the number of items increases, whereas a user-based recommender’s running time goes up as the number of users increases.

Slope One recommender

Algorithm

Preprocessing

  for every item i
    for every other item j
      for every user u expressing preference for both i and j
        add the difference in u's preference for i and j to an average

SlopeOne Algorithm(u: User)

  for every item i for which u expresses no preference
    for every item j for which u expresses a preference
      find the average preference difference between j and i
      add this diff to u's preference value for j
      add this to a running average
   return the top items, raned by these averages

Implementation in a pseudo-code (Following Code is Scala like)

I = { set of all items }
U = { set of all users }
M = { (i,u,p): i in I, u in U, p is a preference value }

AvgDiff = ZeroTriangularMatrix(I.size)

for(i <- I) {
  val sumDiff = 0
  val allExcept_i = I.filter(x => x != i)
  val count = 0
  for( j <- allExcept_i)) {
    for( (u, pi, pj) <- M.findUsersFor(i,j)) {
      sumDiff += pi - pj
      count += 1
    }
  }
  AvgDiff(i)(j) = sumDiff / count
}

def slopeOne(u: User, n: Int) = {
  val notRatedItems = I.filter( x => ! u.preferredItems.contains(x))
  val ratedItems = u.preferredItems();
  val preferenceList = listForAllItems(I)
  for( i <- notRatedItems ) {
    val sumForI = 0
    val count = 0
    for( j <- ratedItems ) {
      val avgDiff = AvgDiff(i)(j)
      sumForI += u.preferenceFor(j) + avgDiff
      count += 1
    }
    val avgForI = sumForI / count
    preferenceList(i) = avgForI
  }
  preferenceList.sortBy(average).reverse().take(n)
}

SlopeOne Gotchas

The difference does not take into account the number of users who provided the ratings. Even if the rating is given for two items only by one user, its weightage will be same as rating given by many more users. This can be mitigated by using weigthing schemes:

Count based weighting - more users means more weightage ( rating is more reliable )
Standard deviation based weighting - less standard deviation means more reliable ratings

Diff calculation is a resource intesive process. It uses a lot of memory as well as CPU. This can be done over offline storage. The compution can also be distributed using Hadoop.

SVD Recommender

Uses Matrix factorization to reduce the total data points, also summarizing them.

Issues: Slow to compute. All SVD is done in memory.

Produces good and meaningful results.

Linear interpolation based Recommender

Implementd as KNN ( K Nearest Neighbours ).

Sample code:

m = new DataModel();
ItemSimilarity s = new LogLikelihoodSimilarity(m);
Optimizer optimizer = new NonNegativeQuadraticOptimizer();
r = new KnnItemBasedRecommender(m,s,optimizer,10);

Cluseter-based recommendation

A variant on user-based CF, where items are recommendd to clusters of similar users. First all users are paritioned into clusters. Now the recommendations are computed for group of users belonging to the cluster.

Clustering

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

Process of clustering involves

an algorithm
a simimarity/dissimarity metric
and a stop condition

TF-IDF measure

Given

N = Number of Documents
TFi = Term Frequency of term i
DFi = Document Frequency of term i ( number of documents in which i occurs )
IDFi = Inverted Document Frequency of term i = 1 / DFi

Therefore:

Wi = Weight of term i
Wi = TF * N / IDF

In the above equation IDF can diminish the effect of TF to a large degree. Therefore, we can use log of IDF

Wi = TF * log(N / IDF)

Finally a good TF-IDF measure is given by the above equation.

Collocation: A sequence of related words occuring together, such as New York City. Although these are three words New, York and City, when combined they have a different meaning. Therefore they can be, together considered as a single term. This is also known as Word N-Grams.

Creating Vectors from Text documents

Create sequence files

$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)

Convert sequence files to sparse vectors. Use TFIDF by default.

$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)

This step creates:

TF vectors in /data/lda/output-seq2sparse/tf-vectors
DF count in /data/lda/output-seq2sparse/df-count
TF-IDF vectors in /data/lda/output-seq2sparse/tfidf-vectors

Normalization

p-norm: Read on Wikipedia
Manhattan norm: When p = 1
Euclidean norm: When p = 2
Infinite norm: Simply divide by weight of largest magnitude dimension.

Following command fails ( using /data/lda/output-seq2sparse as input )

$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2  -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
    at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    ....SKIPPED....
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

However this works just fine ( using /data/lda/output-seqdir as input )

$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2  -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)

KMeans clustering

Running K-Means algorithm on the vectors we generated using SquaredEuclideanDistanceMeasure

$ mahout kmeans -i /data/lda/output-seq2sparse-normalized/tfidf-vectors -c /data/lda/output-kmeans-initialclusters -o /data/lda/output-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20
14/03/25 15:05:46 INFO common.AbstractJob: Command line arguments: {--clusters=[/data/lda/output-kmeans-initialclusters], --convergenceDelta=[1.0], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/data/lda/output-seq2sparse-normalized/tfidf-vectors], --maxIter=[20], --method=[mapreduce], --numClusters=[20], --output=[/data/lda/output-kmeans-clusters], --startPhase=[0], --tempDir=[temp]}
14/03/25 15:05:46 INFO common.HadoopUtil: Deleting /data/lda/output-kmeans-initialclusters
14/03/25 15:05:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/25 15:05:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/03/25 15:05:46 INFO compress.CodecPool: Got brand-new compressor
14/03/25 15:05:49 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to /data/lda/output-kmeans-initialclusters/part-randomSeed
14/03/25 15:05:49 INFO kmeans.KMeansDriver: Input: /data/lda/output-seq2sparse-normalized/tfidf-vectors Clusters In: /data/lda/output-kmeans-initialclusters/part-randomSeed Out: /data/lda/output-kmeans-clusters Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
14/03/25 15:05:49 INFO kmeans.KMeansDriver: convergence: 1.0 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
14/03/25 15:05:49 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath: /data/lda/output-kmeans-clusters/clusters-0
14/03/25 15:05:50 INFO input.FileInputFormat: Total input paths to process : 5
14/03/25 15:05:50 INFO mapred.JobClient: Running job: job_201403241418_0020
14/03/25 15:05:51 INFO mapred.JobClient:  map 0% reduce 0%
...
14/03/25 15:06:31 INFO mapred.JobClient:  map 100% reduce 100%
...
14/03/25 15:06:31 INFO driver.MahoutDriver: Program took 45301 ms (Minutes: 0.7550166666666667)

Dump cluster points:

$ mahout clusterdump  -b 10 -n 10  -dt sequencefile -d /data/lda/output-seq2sparse-normalized/dictionary.file-* -i /data/lda/output-kmeans-clusters/clusters-1-final -o ./kmeans-dump
14/03/25 18:09:22 INFO common.AbstractJob: Command line arguments: {--dictionary=[/data/lda/output-seq2sparse-normalized/dictionary.file-*], --dictionaryType=[sequencefile], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/data/lda/output-kmeans-clusters/clusters-1-final], --numWords=[10], --output=[./kmeans-dump], --outputFormat=[TEXT], --startPhase=[0], --substring=[10], --tempDir=[temp]}
14/03/25 18:09:27 INFO clustering.ClusterDumper: Wrote 20 clusters
14/03/25 18:09:27 INFO driver.MahoutDriver: Program took 4704 ms (Minutes: 0.0784)

KMeans using CosineDistanceMeasure

$ mahout kmeans -i /data/lda/output-seq2sparse-normalized/tfidf-vectors -c /data/lda/output-kmeans-initialclusters -o /data/lda/output-kmeans-cosine-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 20 -x 20
14/03/25 18:21:29 INFO common.AbstractJob: Command line arguments: {--clusters=[/data/lda/output-kmeans-initialclusters], --convergenceDelta=[0.1], --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], --endPhase=[2147483647], --input=[/data/lda/output-seq2sparse-normalized/tfidf-vectors], --maxIter=[20], --method=[mapreduce], --numClusters=[20], --output=[/data/lda/output-kmeans-cosine-clusters], --startPhase=[0], --tempDir=[temp]}
14/03/25 18:21:29 INFO common.HadoopUtil: Deleting /data/lda/output-kmeans-initialclusters
14/03/25 18:21:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/25 18:21:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/03/25 18:21:29 INFO compress.CodecPool: Got brand-new compressor
14/03/25 18:21:33 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to /data/lda/output-kmeans-initialclusters/part-randomSeed
14/03/25 18:21:33 INFO kmeans.KMeansDriver: Input: /data/lda/output-seq2sparse-normalized/tfidf-vectors Clusters In: /data/lda/output-kmeans-initialclusters/part-randomSeed Out: /data/lda/output-kmeans-cosine-clusters Distance: org.apache.mahout.common.distance.CosineDistanceMeasure
14/03/25 18:21:33 INFO kmeans.KMeansDriver: convergence: 0.1 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
14/03/25 18:21:33 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath: /data/lda/output-kmeans-cosine-clusters/clusters-0
14/03/25 18:21:33 INFO input.FileInputFormat: Total input paths to process : 5
14/03/25 18:21:34 INFO mapred.JobClient: Running job: job_201403241418_0022
14/03/25 18:21:35 INFO mapred.JobClient:  map 0% reduce 0%
....SKIPPED....
14/03/25 18:24:28 INFO driver.MahoutDriver: Program took 178941 ms (Minutes: 2.98235)

Dump cluster points:

$ mahout clusterdump  -b 10 -n 10  -dt sequencefile -d /data/lda/output-seq2sparse-normalized/dictionary.file-* -i /data/lda/output-kmeans-cosine-clusters/clusters-4-final -o ./kmeans-cosine-dump
14/03/25 18:26:06 INFO common.AbstractJob: Command line arguments: {--dictionary=[/data/lda/output-seq2sparse-normalized/dictionary.file-*], --dictionaryType=[sequencefile], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/data/lda/output-kmeans-cosine-clusters/clusters-4-final], --numWords=[10], --output=[./kmeans-cosine-dump], --outputFormat=[TEXT], --startPhase=[0], --substring=[10], --tempDir=[temp]}
14/03/25 18:26:10 INFO clustering.ClusterDumper: Wrote 20 clusters
14/03/25 18:26:10 INFO driver.MahoutDriver: Program took 3630 ms (Minutes: 0.0605)

Fuzzy KMeans

$ mahout fkmeans -i /data/lda/output-seq2sparse-normalized/tfidf-vectors -c /data/lda/output-fkmeans-squared-initialclusters -o /data/lda/output-fkmeans-squared-clusters -cd 1.0 -k 20 -m 2 -ow -x 20 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

Problems with KMeans

overlapping ( can be handled with Fuzzy KMeans )
non-circular distribution
not hierarchical

Dirichlet clustering

DisplayDirichlet clustering

$ mahout org.apache.mahout.clustering.display.DisplayDirichlet
14/03/28 14:58:30 WARN driver.MahoutDriver: No org.apache.mahout.clustering.display.DisplayDirichlet.props found on classpath, will use command-line arguments only
14/03/28 14:58:36 INFO display.DisplayClustering: Generating 500 samples m=[1.0, 1.0] sd=3.0
14/03/28 14:58:36 INFO display.DisplayClustering: Generating 300 samples m=[1.0, 0.0] sd=0.5
14/03/28 14:58:36 INFO display.DisplayClustering: Generating 300 samples m=[0.0, 2.0] sd=0.1
14/03/28 14:58:53 INFO driver.MahoutDriver: Program took 23027 ms (Minutes: 0.3837833333333333)

Generate matrix:

mahout-distribution-0.9$ bin/mahout rowid -i /data/clustering/reuters-out-sparse/tf-vectors -o /data/clustering/reuters-out-rowid/
Running on hadoop, using ...hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ...mahout-distribution-0.9/mahout-examples-0.9-job.jar
14/03/31 12:49:36 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/data/clustering/reuters-out-sparse/tf-vectors], --output=[/data/clustering/reuters-out-rowid/], --startPhase=[0], --tempDir=[temp]}
14/03/31 12:49:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/31 12:49:37 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/03/31 12:49:37 INFO compress.CodecPool: Got brand-new compressor
14/03/31 12:49:37 INFO compress.CodecPool: Got brand-new compressor
14/03/31 12:49:39 INFO vectors.RowIdJob: Wrote out matrix with 21578 rows and 57545 columns to /data/clustering/reuters-out-rowid/matrix
14/03/31 12:49:39 INFO driver.MahoutDriver: Program took 3257 ms (Minutes: 0.054283333333333336)

Mahout 0.7

dirichlet
kmeans / fkmeans: throwing errors with heap space

Mahout 0.9

dirichlet doesn’t exist
kmeans / fkmeans work better without throwing any errors

clusterpp

Set env variables

DISTMETRIC=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
TFIDF_VEC=/data/clustering/reuters-out-sparse/tfidf-vectors
INITCLUSTERS=/data/clustering/reuters-out-kmeans-initialclusters
CLUSTERS=/data/clustering/reuters-out-kmeans-clusters

Run kmeans

$ hadoop fs -rmr /data/clustering/reuters-out-kmeans-clusters
$ mahout kmeans -cl -cd 1.0 -k 20 -x 20 -dm $DISTMETRIC -i $TFIDF_VEC -c $INITCLUSTERS -o $CLUSTERS
Running on hadoop, using ...hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ...mahout-distribution-0.9/mahout-examples-0.9-job.jar
14/03/31 15:08:33 INFO common.AbstractJob: Command line arguments: {--clusters=[/data/clustering/reuters-out-kmeans-initialclusters], --convergenceDelta=[1.0], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/data/clustering/reuters-out-sparse/tfidf-vectors], --maxIter=[20], --method=[mapreduce], --numClusters=[20], --output=[/data/clustering/reuters-out-kmeans-clusters], --startPhase=[0], --tempDir=[temp]}
14/03/31 15:08:33 INFO common.HadoopUtil: Deleting /data/clustering/reuters-out-kmeans-initialclusters
14/03/31 15:08:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
.... OUTPUT SKIPPED ...
14/03/31 15:10:04 INFO driver.MahoutDriver: Program took 91562 ms (Minutes: 1.5260333333333334)

Run clusterpp

$ mahout clusterpp -i /data/clustering/reuters-out-kmeans-clusters -o /data/clustering/reuters-out-clusterpp -xm mapreduce -ow

Classification

The data available can be modeled in terms of records, fields and target variables:

Record(field1, field2, ... fieldN) -> targetVariable

OR

(features or predictor variables) -> targetVariable

Learning process

List(features, target) ==> learning algorithm ==> model

Classification process

List(features) --> model ==> target

Kinds of Predictor Variables

Continuous ( infinite )
Categorical ( finite discreet )
Word-like ( discreet infinite single words)
Text-like

Running and understanding SGD

$ mahout cat donut.csv
$ mahout trainlogistic --input donut.csv --output ./model --target color --categories 2 --predictors x y --types numeric --features 20 --passes 100 --rate 50
20
color ~ -0.149*Intercept Term + -0.701*x + -0.427*y
      Intercept Term -0.14885
                   x -0.70136
                   y -0.42740
0 0 0 0 0 0 0 0 0 0 -0.701362221 0 0 -0.148846792 0 0 0 -0.427403872 0 0
$ mahout runlogistic --input donut.csv --model ./model --auc --confusion
AUC = 0.57
confusion: [  [ 27.0,  13.0],
              [  0.0,   0.0]  ]
entropy:   [  [ -0.4, -0.3 ],
              [ -1.2, -0.7 ]  ]

AUC - area under curve

confusion - confusion matrix

entropy

Add more predictors and passes

$ mahout trainlogistic --input donut.csv --output model --target color --categories 2 --predictors x y a b c --types numeric --features 20 --passes 100 --rate 50
20
color ~ 7.068*Intercept Term + 0.581*a + -1.369*b + -25.059*c + 0.581*x + 2.319*y
      Intercept Term 7.06759
                   a 0.58123
                   b -1.36893
                   c -25.05945
                   x 0.58123
                   y 2.31879
0 0 0 0 0 -1.368933989 0 0 0 0 0.581234210 0 0 7.067587159 0 0 0 2.318786209 0 -25.059452292

Results improve this time

$ mahout runlogistic --input donut.csv --model ./model --auc --confusion
AUC = 1.00
confusion: [[27.0, 1.0], [0.0, 12.0]]
entropy: [[-0.1, -1.5], [-4.0, -0.2]]

Running on other data

$ mahout runlogistic --input donut-test.csv --model ./model --auc --confusion
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -2.8], [-4.1, -0.1]]

Different Classification Algorithms in Mahout

SGD / Logistic Regression Algorithms:

OnlineLogisticRegression
AdaptiveLogisticRegression
CrossFoldLearner

Naive-Bayes Algorithm

Complimentary Naive-Bayes

Hidden Markov Model

Random Forest

Others:

What happened to SVM implementations in Mahout ?

Text Mining

Discussion on Calculate Term-Cooccurence

Apache Mahout Collocations

High Frequency Terms/Phrases at the Index level JIRA Ticket LUCENE-474

What is the easiest way to implement terms association mining in Solr?

Topic Modeling

References:

* Machine Learning and Data Mining in Ruby and R * Machine Learning with Apache Mahout * Introduction to Apache Mahout * Apache Taste * Apache Mahout Recommender Tutorial * First Steps with Apache Mahout - Classification * Apache Mahout and MySQL * Correlation explained * Third Generation Tools for Realizing Machine Learning Algorithms * Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model * An Introduction To Mahout’s Logistic Regression SGD Classifier * Mahout Scala * Mahout SGD Example * Converting Similariy to Distance and vice-versa * Probability Distributions in R * Word Space Model by Magnus Sahlgren