Wednesday, 7 May 2014

Playing with Suliman's proteomic data.....

I have decided to start simply in R.
I have opened a new project. This clears all the data from the memory which looks like a clean start.

Steps:

  1. Get the data in..  NP40 = read.csv("...")
    1. First time I did this, I used read.txt and the data was in the wrong format!
  2. Look at the data.. head (NP40)
  3. Attach(NP40) - then I can use the headings. This is faster than using $ all the time. 
  4. Play a bit - plot(Pat_1~Pat_2)
  5. Try default distance measures dist(rbind(Pat_1, Pat_2, Pat_3....)
  6. rbind combines the data by rows and is required for the dist function
  7. Put these in a matrix themselves distances <- dist(rbind(Pat_1, Pat_2....)
  8. Do hierarchy clustering hc <- hclust(distances) and then plot(hc)
So this all worked and made a cluster and gave some distances. 


Distances:

           Pat_1     Pat_2     Pat_3     Pat_4     Pat_5     Pat_6     Pat_7     Pat_8     Pat_9    Pat_10

Pat_2   7.537715                                                                                          
Pat_3   3.875914  6.809347                                                                                
Pat_4   5.263328  8.908621  4.036732                                                                      
Pat_5   6.415988  7.497405  5.970487  6.052602                                                            
Pat_6   5.335291  7.945542  5.344391  5.737485  5.301845                                                  
Pat_7   9.638823 11.860930  8.369786  8.654071 12.813030 13.373229                                        
Pat_8  10.546800  7.906824 10.610556 11.005512 10.942666 11.187895 12.030716                              
Pat_9   5.540512  6.480492  5.825622  7.233939  5.790468  4.195550 10.479076  9.158402                    
Pat_10  4.815430  8.433853  5.014136  7.109553  5.986761  4.804471 12.413832 11.686438  5.207372          
Pat_11  4.632951  8.585297  5.024850  6.874530  5.928018  4.487083 12.188807 11.433096  4.944108  2.930767

The distances vary from 2.93 Patient 10 vs Patient 11 to 13.4 Patient 6 vs Patient 7.

> summary(distances)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.931   5.319   6.875   7.603  10.060  13.370 

If we plot these two it generates interesting plots:


Plot of closest Patient samples (Pat_10 vs Pat_11)
r-squared = 0.49, p<0.0001



 Plot of farthest patient samples (Pat_6 vs Pat_7)
r-squared = -0.002. p=0.43

I decided to plot the residuals of this to see if there are outliers in the data set. 
>largedist.res = resid(largedist)
This function resid calculates the residuals
> plot(largedist.res)
gives...

I think anything with a residual above 0.5 is probably interesting and may merit some investigation. Certainly, the two dots above 1.0 merit a look....

I wanted to compare a third set of data that used different patients samples but still had a large distance. I chose to
>plot(Pat4~Pat8)
which gave: 
r-squared=0.15, p>0.001

This is interesting as event though the distance is relatively large and the r-squared is quite small, the correlation is significant. 

Residual graph is here:

There maybe some interesting proteins here. 

Sadly, it is not immediately easy to work out which protein is which as we lose the protein IDs during the process. The plots don't require this. 

Interesting afternoon with lots to think about. 
Some nice productive playing with R. 

Some more questions come to mind:
  • Will the SDS data generate the same hierarchy cluster as the NP40 data?
  • Is there a way to evaluate distance? Do the values mean anything? I guess you can average this and then divide by the average. This might give some sort of outlier information...
That's enough for today. 





No comments:

Post a Comment