Wednesday, 7 May 2014

Hierarchial clustering...

For Suliman's manuscript, we have been asked to do some more statistical analysis. The reviewer particularly recommended hierarchial clustering.

Thinking about it and learning a little bit more about what is involved suggests some possible questions:

  • Can we cluster our patients based on proteomic data?
  • What proteins determine the hierarchies?
Some issues about normalisation come to mind. 

An interesting PDF is located at www.microarrays.ca/services/hierarchical_clustering.pdf‎. This said that "the idea of this method is to build a hierarchy of clusters, showing relations between the individual members and merging clusters of data based on similarity."

A key concept is a "distance metric" which is a measure of similarity. There are different measures of correlation. Two common ones are the Euclidean and the Pearson correlations. Euclidean distance looks at just the numbers while the Pearson correlation looks more at trends. This can give very different patterns. 
Other measures of distance include: maximum, Manhattan, Canberra, binary and Minkowski. 

"For more gene expression experiments you will likely find Pearson correlations to be more appropriate."


This website here talks about using R for hierarchial clustering. 
It talks about various linkage methods: single, median, average, centroid, Ward's, McQuitty's. 

Euclidian Distance:- Square root of sum of squares of attribute differences. 

This is going to be a learning curve!!!




No comments:

Post a Comment