Map reduce have been a lot of help in dealing with huge data sets , and I have worked a lot in adapting machine learning algorithms for hadoop. One of the simplest examples of this would be implementation of a k means clustering algorithm for hadoop.
K means clustering algorithm is one of the most widely known algorithm for clustering.Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. The defining idea for similarity is a distance measure which might be simple vector distance when objects are vectors or various string similarity metrics e.g Jaccard Similarity when the objects are search queries. (Read more abt string similarity metrics at simmetrics )
K means clustering algorithm can be described as :
Initialize the algorithm randomly (i.e randomly assign K initial centroids) and then iterate through the following steps until convergence (or a stopping criteria met):
- For each data point x, compute its membership in clusters by choosing the nearest centroid
- For each centroid, recompute its location according to members
This algorithm can be adapted very easily to map reduce. The objects are stored with key as the clusterid and value equal to the actual object
- In the initialization state , Compute K random centroids and store them as an array
- Each iteration repeat the following map-reduce cycleMap : Read the array of centroids stored from the previous iterations. The input is <object-value> to the map phase. For each such pair , compute the closest centroid to the object and emit <clusterid of closest centroid, object-value> as the key,value pair
- Map : Read the array of centroids stored from the previous iterations. The input is <object-value> to the map phase. For each such pair , compute the closest centroid to the object and emit <clusterid of closest centroid, object-value> as the key,value pair
- Reduce: All the objects corresponding to the same cluster (key) are sent to the same reduce phase, the next step is to recompute the new centroids which can be done by computing the mean of the objects corresponding to the same key = clusterid . Once all the objects corresponding to one cluster are read, the new median is computed and stored somewhere to be read by the next iteration.
I will describe how different algorithms (both machine learning and otherwise) might be adopted for map reduce in my next blogs. Seems to be a good idea to blog about.