关注 spark技术分享,
撸spark源码 玩spark最佳实践

KMeans

KMeans

KMeans class is an implementation of the K-means clustering algorithm in machine learning with support for k-means|| (aka k-means parallel) in Spark MLlib.

Roughly, k-means is an unsupervised iterative algorithm that groups input data in a predefined number of k clusters. Each cluster has a centroid which is a cluster center. It is a highly iterative machine learning algorithm that measures the distance (between a vector and centroids) as the nearest mean. The algorithm steps are repeated till the convergence of a specified number of steps.

Note
K-Means algorithm uses Lloyd’s algorithm in computer science.

It is an Estimator that produces a KMeansModel.

Tip
Do import org.apache.spark.ml.clustering.KMeans to work with KMeans algorithm.

KMeans defaults to use the following values:

  • Number of clusters or centroids (k): 2

  • Maximum number of iterations (maxIter): 20

  • Initialization algorithm (initMode): k-means||

  • Number of steps for the k-means|| (initSteps): 5

  • Convergence tolerance (tol): 1e-4

KMeans assumes that featuresCol is of type VectorUDT and appends predictionCol of type IntegerType.

Internally, fit method “unwraps” the feature vector in featuresCol column in the input DataFrame and creates an RDD[Vector]. It then hands the call over to the MLlib variant of KMeans in org.apache.spark.mllib.clustering.KMeans. The result is copied to KMeansModel with a calculated KMeansSummary.

Each item (row) in a data set is described by a numeric vector of attributes called features. A single feature (a dimension of the vector) represents a word (token) with a value that is a metric that defines the importance of that word or term in the document.

Tip

Enable INFO logging level for org.apache.spark.mllib.clustering.KMeans logger to see what happens inside a KMeans.

Add the following line to conf/log4j.properties:

Refer to Logging.

KMeans Example

You can represent a text corpus (document collection) using the vector space model. In this representation, the vectors have dimension that is the number of different words in the corpus. It is quite natural to have vectors with a lot of zero values as not all words will be in a document. We will use an optimized memory representation to avoid zero values using sparse vectors.

This example shows how to use k-means to classify emails as a spam or not.

赞(0) 打赏
未经允许不得转载:spark技术分享 » KMeans
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏