Implement K-Means Clustering Using sklearn.cluster.KMeans in Python

By | November 10, 2022

In python, we can implement K-Means clustering by using sklearn.cluster.KMeans easily. In this tutorial, we will use some examples to show you how to do.

Syntax

sklearn.cluster.KMeans is defined as:

class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')

This class can allow us implement k-means clustering easily.

Here are some important parameters.

n_clusters: int, how many classes you plan to cluster. For example n_clusters = 10, you plan to cluster data to 10 classes.

max_iter: default = 300, maximum number of iterations of the k-means algorithm for a single run.

random_state: determines random number generation for centroid initialization. Use an int to make the randomness deterministic. For example random_state = 0

algorithm: {“lloyd”, “elkan”, “auto”, “full”}, default=”lloyd”

“auto” and “full” are deprecated and they will be removed in Scikit-Learn 1.3

How to use implement K-Means Clustering?

In order to use sklearn.cluster.KMeans class to implement k-means, we should determine two important data.

  • inputs – the data you plan to cluster, it should be a numpy array and the shape of it shoud be [sample_num, feature_dim].

For example: inputs = np.random.random([100, 200]), which means we have 100 samples, each sample is [1, 200] vector.

  • k – how many classes you plan to cluster. For example k = 10, it means we will cluster 10 classes.

Here we will use an example to show you how to do.

from sklearn.cluster import KMeans
import numpy as np

# prepare data
sample_num = 200
feature_dim = 100
data = np.random.random([sample_num, feature_dim])

Here we will create a sample data set, which contains 200 samples.

k = 10
kmeans = KMeans(n_clusters=k, random_state=0, max_iter = 500).fit(data)

This code will tell us we will cluster 10 classes.

print(kmeans.labels_)
print(kmeans.cluster_centers_.shape)

This code is very important, it will tell us the class label of each sample and the cluster center.

Run this code, we will see:

[1 5 1 5 7 2 7 5 1 7 0 3 6 2 9 0 1 6 0 2 4 1 5 8 4 2 0 5 5 7 6 8 2 2 4 5 5
 5 1 5 6 5 0 2 1 8 5 6 5 1 8 8 1 1 3 9 2 7 6 5 5 6 2 5 5 4 6 6 0 7 1 2 7 2
 5 2 9 7 8 6 7 5 9 9 5 7 9 7 2 3 3 1 4 5 1 2 5 4 9 8 0 6 8 7 5 6 3 0 5 0 7
 5 7 8 2 3 8 1 2 4 0 7 7 2 2 0 5 0 8 2 7 4 9 9 0 9 5 7 9 9 5 9 7 2 8 4 4 8
 4 0 5 3 9 3 8 8 8 2 4 2 4 2 7 3 6 9 0 6 9 5 8 0 0 0 5 9 8 0 2 5 8 6 4 0 7
 5 7 5 5 7 5 9 5 8 1 1 5 6 5 8]
(10, 100)

We can find the class lable starts with 0-9. The shape of cluster center is [k, feature_dim] = [10, 100]

Finally, we can use this k-means to predict the class label of one test data set.

test_sample = np.random.random([5, feature_dim])
print(kmeans.predict(test_sample))

Run this code, we may see:

[2 9 6 6 5]