Saturday, March 21, 2009

Map reduce and k means

Map reduce have been a lot of help in dealing with huge data sets , and I have worked a lot in adapting machine learning algorithms for hadoop. One of the simplest examples of this would be implementation of a k means clustering algorithm for hadoop.

K means clustering algorithm is one of the most widely known algorithm for clustering.Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. The defining idea for similarity is a distance measure which might be simple vector distance when objects are vectors or various string similarity metrics e.g Jaccard Similarity when the objects are search queries. (Read more abt string similarity metrics at simmetrics )

K means clustering algorithm can be described as :

Initialize the algorithm randomly (i.e randomly assign K initial centroids) and then iterate through the following steps until convergence (or a stopping criteria met):

  1. For each data point x, compute its membership in clusters by choosing the nearest centroid
  2. For each centroid, recompute its location according to members

This algorithm can be adapted very easily to map reduce. The objects are stored with key as the clusterid and value equal to the actual object


  1. In the initialization state , Compute K random centroids and store them as an array

  2. Each iteration repeat the following map-reduce cycleMap : Read the array of centroids stored from the previous iterations. The input is <object-value> to the map phase. For each such pair , compute the closest centroid to the object and emit <clusterid of closest centroid, object-value> as the key,value pair

    1. Map : Read the array of centroids stored from the previous iterations. The input is <object-value> to the map phase. For each such pair , compute the closest centroid to the object and emit <clusterid of closest centroid, object-value> as the key,value pair
    2. Reduce: All the objects corresponding to the same cluster (key) are sent to the same reduce phase, the next step is to recompute the new centroids which can be done by computing the mean of the objects corresponding to the same key = clusterid . Once all the objects corresponding to one cluster are read, the new median is computed and stored somewhere to be read by the next iteration.

I will describe how different algorithms (both machine learning and otherwise) might be adopted for map reduce in my next blogs. Seems to be a good idea to blog about.

Thursday, February 5, 2009

An interesting puzzle

An evil king has 1000 bottles of wine. A neighboring queen plots to kill the bad king, and sends a servant to poison the wine. The king's guards catch the servant after he has only poisoned one bottle. The guards don't know which bottle was poisoned, but they do know that the poison is so potent that even if it was diluted 1,000,000 times, it would still be fatal. Furthermore, the effects of the poison take one month to surface. The king decides he will get some of his prisoners in his vast dungeons to drink the wine. Rather than using 1000 prisoners each assigned to a particular bottle, this king knows that he needs to murder no more than 10 prisoners to figure out what bottle is poisoned, and will still be able to drink the rest of the wine in 5 weeks time. How does he pull this off ??



I will post the solution in my next post ..








Tuesday, February 3, 2009

The present Economy



Feel Something familiar happening right now



Sunday, February 1, 2009

Mobile Positioning Techniques

The presentation has a nice overview of location finding methods on a mobile plus the different apis available to a developer to use those methods.