Examples of Documents Clustering

 Documents Clustering





Consider two documents with term vectors:

Document 1 Vector=[2,1,0] Document 2 Vector=[0,1,2]

Initial centroids:

Centroid 1=[1,0,1] Centroid 2=[0,2,0]

Calculate distances, assign documents to clusters, and update centroids iteratively until convergence.

Iteration 1:

Assignment:

Cluster(1)=argmin(21)2+(10)2+(01)2=Cluster 1 Cluster(2)=argmin(00)2+(12)2+(20)2=Cluster 2

Update Centroids:

Centroid 1=11Cluster 1[2,1,0]=[2,1,0] Centroid 2=11Cluster 2[0,1,2]=[0,1,2]

Iteration 2:

Assignment:

Cluster(1)=argmin(22)2+(11)2+(00)2=Cluster 1 Cluster(2)=argmin(00)2+(11)2+(22)2=Cluster 2

Update Centroids:

Centroid 1=11Cluster 1[2,1,0]=[2,1,0] Centroid 2=11Cluster 2[0,1,2]=[0,1,2]

Convergence:

The centroids have not changed between iterations, indicating convergence. The final clusters are:

Cluster 1={1} Cluster 2={2}

In this simplified example, we have two clusters with one document in each cluster. In practice, K-means is often applied to larger document sets and with more sophisticated initialization strategies to improve convergence and clustering quality.


K-means clustering is a partitioning method that divides a dataset into K distinct, non-overlapping subsets (clusters). Each data point belongs to the cluster with the nearest mean, and the mean is recalculated as the centroid of the points in the cluster. This process is iteratively repeated until convergence.

K-Means Algorithm:

  1. Initialization:

    • Randomly select K initial centroids.
  2. Assignment Step:

    • Assign each data point to the nearest centroid, forming K clusters.

    Cluster()=argmin2

    • Where is a data point, is the centroid of cluster , and denotes the Euclidean distance.
  3. Update Step:

    • Update the centroids by calculating the mean of all data points in each cluster.

    =1Cluster()Cluster()

  4. Repeat Assignment and Update Steps:

    • Iteratively repeat the assignment and update steps until convergence (when centroids do not change significantly or a specified number of iterations is reached).

Example:

Let's consider a simple example with a table of data points:

Data PointFeature 1Feature 2
A12
B23
C22
D33
E87
F98
G107

Initialization:

  • Choose K = 2 and randomly select initial centroids: 1=(2,2), 2=(8,7).

Iteration 1:

  • Assignment Step:

    • Assign each point to the nearest centroid.
      • Cluster 1: {A, B, C, D}
      • Cluster 2: {E, F, G}
  • Update Step:

    • Recalculate centroids.
      • 1=14(1+2+2+3,2+3+2+3)=(2,2.5)
      • 2=13(8+9+10,7+8+7)=(9,7.333)

Iteration 2:

  • Repeat assignment and update steps.

Convergence:

  • Continue iterations until centroids stabilize.

In practice, you may use Python libraries like scikit-learn to apply the K-means algorithm efficiently.

Hierarchical Clustering:

Hierarchical Clustering is a method of cluster analysis that builds a hierarchy of clusters. It can be visualized using a tree-like diagram called a dendrogram. There are two main types of hierarchical clustering: Agglomerative (bottom-up) and Divisive (top-down).

Agglomerative Hierarchical Clustering Algorithm:

  1. Initialization:

    • Start with each data point as a separate cluster.
  2. Pairwise Similarity Calculation:

    • Calculate the similarity (or distance) between each pair of clusters. The choice of similarity measure depends on the nature of the data (e.g., Euclidean distance, correlation).
  3. Merge Step:

    • Merge the two most similar clusters into a new cluster. Update the similarity matrix.
  4. Repeat Steps 2-3:

    • Repeat the pairwise similarity calculation and merge steps until only a single cluster remains.

Example:

Let's consider a simple example with a table of data points:

Data PointFeature 1Feature 2
A12
B23
C22
D33
E87
F98
G107

Agglomerative Hierarchical Clustering Steps:

Step 1: Initialization:

  • Each data point is initially a separate cluster.

Step 2: Pairwise Similarity Calculation:

  • Calculate pairwise Euclidean distances between clusters.

Step 3: Merge Step:

  • Merge the two closest clusters.
    • Let's say the closest clusters are A and C, merge them into a new cluster AC.

Updated Table:

ClusterFeature 1Feature 2
AC1.52
B23
D33
E87
F98
G107
  • Repeat steps 2-3 until only one cluster remains.

Step 2: Pairwise Similarity Calculation (Updated Table):

  • Calculate pairwise Euclidean distances between clusters.

Step 3: Merge Step (Final Step):

  • Merge the two closest clusters.
    • Let's say the closest clusters are AC and B, merge them into a new cluster ABC.

Final Result:

  • The dendrogram would show the hierarchical relationships between clusters.

In practice, you may use Python libraries like scipy and scikit-learn to apply hierarchical clustering efficiently and visualize the results.

How to Calculate pairwise Euclidean distances between clusters.?

How to Merge two closest Clusters

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.