## Introduction

This is a re-run of the previous fungi data with improved algorithms. Overall steps, however, remained the same except for the following additions.

- Fixed points multi-dimensional scaling (MDS) plots (read more)
- Full scale collage plot from region MDSs (read more)
- Centers for each cluster (read more)

The following is an image of the new interpolation results after region refinements. Note the introduction of a new seventh mega region.

A snapshot of the full scale collage constructed out of MDSs of each region is given below.

## Summary of Differences

The following table summarizes the change in number of points per each mega region. More details on this is available here.

Region | In Old and New | In Old NOT New | In New NOT Old | Old Total | New Total |
---|---|---|---|---|---|

0 | 109158 | 2071 | 16640 | 111229 | 125798 |

1 | 40634 | 9601 | 6405 | 50235 | 47039 |

2 | 112163 | 2460 | 12046 | 114623 | 124209 |

3 | 34480 | 7694 | 4732 | 42174 | 39212 |

4 | 34315 | 39570 | 446 | 73885 | 34761 |

5 | 36399 | 2044 | 5471 | 38443 | 41870 |

6 | 12967 | 2485 | 9745 | 15452 | 22712 |

7 | 0 | 0 | 10440 | 0 | 10440 |

## Data Set

The initial data set was received from Dr. Haixu Tang in Indiana University. The details are as follows.

- FastA file

-- allreads.fna - Total sequence count

-- 957387 - Unique sequence FastA file

-- allreads_uniques_482158.txt - Unique sequence count

-- 482158 - Unique sequences with lengths greater than 200 FastA file

-- allreads_uniques_gt200_446041_random.txt - Unique sequences with lengths greater than 200 count

-- 446041

## Process

- Pick 100K random sample from
unique sequences with lengths greater than 200

-- allreads_uniques_gt200_440641_random_100k_0.txt - The remaining is called out-sample sequences
- Run pairwise local sequence alignment (Smith-Waterman) on 1.
- Run DA(Deterministic Annealing)-SMACOF on 3.
- Run Deterministic Annealing pairwise clustering on 3.
- Produce plot from 4. and 5.
- Refine 6. for spatially compact Mega-regions
- Assign out-sample sequences in 2. to Mega-regions of 7. based on a nearest neighbour approach
- Extract sequences for each such Mega-region from full unique sequence set
- For each Mega-region sequence set
- Run pairwise local sequence alignment (Smith-Waterman)
- Run DA(Deterministic Annealing)-SMACOF on 10.1
- Run MDSasChisq in SMACOF mode on 10.1 with sample points of the region fixed to locations from 4
- Run Deterministic Annealing pairwise clustering on 10.1
- Produce plot from 10.2 and 10.3
- While refinements necessary for 10.5
- Extract distances for necessary sub clusters
- Run Deterministic Annealing pairwise clustering on 10.5.1
- Merge results of 10.5.2 with 10.3
- Produce plot from 10.2 and 10.5.3
- Go to 10.5

- Produce final region specific plot from 10.2 and 10.6
- Produce final region specific fixed run plot from 10.3 and 10.6
- Classify clusters into three groups (see cluster status) for clusters in 10.7
- Find cluster centers for each clean cluster in MDSs of 10.7 and 10.8

## Fixed MDS Plots

Each mega region has a subset of sequences coming from the sample set (see process for details). We ran MDS for each region while fixing these sample sequences to a set of known positions. These known positions were determined by running MDS on sample sequences only.

## Full Scale Collage

Regions are a method used to break down the large computation necessary otherwise, but may not necessarily be related with a biological categorization. Thus, we wanted a mechanism to produce a single plot containing points from all regions, so biologists may have a better view at the sequences as a whole. The solution was to use the fixed points MDS plots as shown above.

## Cluster Centers

Once we are satisfied with the clustering results, we found sequences to represent each of them, which we
denote as cluster centers. Three methods were used in finding centers resulting three center sequences per
each cluster. Later, each of these centers were evaluated based on the position in the cluster (of fixed
points plots), thus keeping only the sequence that best appear to represent the cluster. If more than one
best centers were found we picked the one with the longest sequence length.
See Refined Cluster Centers page for more information

Description of the three methods is as follows.

- Sequence Mean

-- Given a cluster this is the sequence that has the minimum mean distance to other points in the same cluster. - Euclidean Mean

-- Similar to sequence mean, this method also finds the sequence that has the minimum mean distance to other points in a cluster. However, the distances are taken from the three dimensional Euclidean space rather than from distances computed from alignments. - Centroid of Cluster

-- This method finds the centroid point of the cluster in the Euclidean space. Then, in the same space, it finds the point nearest to the centroid. the sequence represented by this point is taken as one of the centers for the particular cluster.

## Dependence of Results on Sequence Length

An extensive study of length dependency was done with previous run of this data, hence we avoided redoing it in this run as it was the same set of sequences and we expected similar results. Details of the previous analysis is available here.

## Cluster Status

We classified clusters into 3 different types

Cluster Status | Meaning |
---|---|

1 | Good Clean Cluster |

2 | Clean Cluster but some refinement could be useful |

3 | Debris |

Note "clustering program" puts all sequences into 1 and only 1 cluster. So sequences scattered in background are put in clusters. Thus category 3 label "clusters" which are really scattered sequences filling void between clusters.

Note some debris sequences are tails to "real" clusters.

Category 2 correspond to cases where further refinement could be useful
-- especially with increased statistics.

## References

## Technologies Used

- Twister
- MPI.NET
- Dimension Reduction with Deterministic Annealing SMACOF
- Dimension Reduction by Interpolation
- Clustering by Deterministic Annealing Pairwise Techniques
- Smith Waterman Gotoh Distance Computation
- .NET Bio (formerly Microsoft Biology Foundation)

Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research

This is a SALSAHPC project