Jump to: navigation, search


The Region Clustering is an algorithm used to improve the clustering result based on the information from Pairwise Clustering(PWC) and Multidimensional Scaling. The reason to create this method for region define is because performance of PWC sometimes is ill defined. We want to improve the mega cluster definition for later sub-clustering. To do Region Clustering, we need to introduce the Barnes Hut tree--an octree to divide the points in target dimension space into different tree nodes (regions). This method was first used in nbody simulation. Here is a picture from 3D space for an example of 3D barnes hut tree:

BH Tree Sample
BH Tree Sample


Now inspired by this method, we apply it into region clustering.

  1. Produce a set i=1..M of “mega-clusters”
  2. Perform an Oct-Tree analysis of data with a cut off such that number of terminal nodes is in range D.
  3. All terminal nodes are assigned to one of 3 categories
    1. Belong to a node cluster i = 1.. M
    2. Unclear mixture i = U
    3. In the intergalactic void i = V
  4. Let f be a fraction as threshold
  5. Iterate Following
    1. Loop over terminal nodes
    2. Assign category to each terminal node
      1. If NOT lowest level assign to V
      2. If lowest level , look at node cluster assignments of points in node. If a fraction greater than f belong to cluster i, assign to i.
      3. If lowest level and no node cluster has fraction f, assign terminal node to U.
    3. Find the center sequence P in each nodes to represent the node
    4. Assign sequences to node cluster whose center they are nearest. Only do this this for sequences in lowest level nodes.
    5. Go to step 5
  6. Finally assign all sequences and nodes to node cluster they are nearest.

By applying region clustering, our mega cluster will be much clearer as shown in the following figures. And this method is extremely fast since it uses the octree to avoid iterate through all the points like in k-means algorithm.

100k data before region clustering 100k data after region clustering with trees visualized
100k data before region clustering 100k data after region clustering with trees visualized


There are many possible extension for this algorithm. We do not necessarily need the PWC result since we can manually assign the cluster number and cluster center for some visually cluster at the beginning and then apply the region clustering algorithm on that to improve it. Another extension on this algorithm is to add nodes cluster number changing method into it to avoid the edges mixture of several different colors while it is a visually a small cluster.