User login



Identifying dissimilarity is critical to the discovery of hidden structure from large gene (or genome ALU) sequences. PDC is particularly suited for processing gene data as it keeps distance ratio (structure–preserving) at reduction from high to low dimensional Euclidian space. When used with Deterministic Annealing approach, PDC achieves robustness of maximum entropy inference.  We parallelize the algorithm in C# to run on multicore clusters using CCR and MPI. Our preliminary results in MDS show 3D distributions of 9000 Alu genome sequences.

  Deterministic Annealing for Pairwise Clustering

There’s one class of mobile elements in the genome called "Alu". The purpose of this classification is to make subfamily and analysis hidden structure .
  • "chr3“ is chromosome name
  • “4579922" is start position, “4580207" is end position
  • "+"/"C" is strand("+" is plus strand and "C" is minus strand since genome sequences is double strand),
  • AluJb is the family name. All the sequences in this data set are AluJb.
  Complete decomposition of 3000 ALU sequences into clusters

4500 Points Pairwise Annealing with distances determined pairwise

  Same ALU Sequences with all sequences aligned with Clustal W (Multiple Alignment)

  • Distances scaled before visualization to correspond to effective dimension of 4
  • Original effective dimension 20