I have opened a new project. This clears all the data from the memory which looks like a clean start.
Steps:
- Get the data in.. NP40 = read.csv("...")
- First time I did this, I used read.txt and the data was in the wrong format!
- Look at the data.. head (NP40)
- Attach(NP40) - then I can use the headings. This is faster than using $ all the time.
- Play a bit - plot(Pat_1~Pat_2)
- Try default distance measures dist(rbind(Pat_1, Pat_2, Pat_3....)
- rbind combines the data by rows and is required for the dist function
- Put these in a matrix themselves distances <- dist(rbind(Pat_1, Pat_2....)
- Do hierarchy clustering hc <- hclust(distances) and then plot(hc)
So this all worked and made a cluster and gave some distances.
Distances:
Pat_1 Pat_2 Pat_3 Pat_4 Pat_5 Pat_6 Pat_7 Pat_8 Pat_9 Pat_10
Pat_2 7.537715
Pat_3 3.875914 6.809347
Pat_4 5.263328 8.908621 4.036732
Pat_5 6.415988 7.497405 5.970487 6.052602
Pat_6 5.335291 7.945542 5.344391 5.737485 5.301845
Pat_7 9.638823 11.860930 8.369786 8.654071 12.813030 13.373229
Pat_8 10.546800 7.906824 10.610556 11.005512 10.942666 11.187895 12.030716
Pat_9 5.540512 6.480492 5.825622 7.233939 5.790468 4.195550 10.479076 9.158402
Pat_10 4.815430 8.433853 5.014136 7.109553 5.986761 4.804471 12.413832 11.686438 5.207372
Pat_11 4.632951 8.585297 5.024850 6.874530 5.928018 4.487083 12.188807 11.433096 4.944108 2.930767
The distances vary from 2.93 Patient 10 vs Patient 11 to 13.4 Patient 6 vs Patient 7.
> summary(distances)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.931 5.319 6.875 7.603 10.060 13.370
If we plot these two it generates interesting plots:
Plot of closest Patient samples (Pat_10 vs Pat_11)
r-squared = 0.49, p<0.0001
r-squared = -0.002. p=0.43
I decided to plot the residuals of this to see if there are outliers in the data set.
>largedist.res = resid(largedist)
This function resid calculates the residuals
> plot(largedist.res)
gives...
I think anything with a residual above 0.5 is probably interesting and may merit some investigation. Certainly, the two dots above 1.0 merit a look....
I wanted to compare a third set of data that used different patients samples but still had a large distance. I chose to
>plot(Pat4~Pat8)
which gave:
r-squared=0.15, p>0.001
This is interesting as event though the distance is relatively large and the r-squared is quite small, the correlation is significant.
Residual graph is here:
There maybe some interesting proteins here.
Sadly, it is not immediately easy to work out which protein is which as we lose the protein IDs during the process. The plots don't require this.
Interesting afternoon with lots to think about.
Some nice productive playing with R.
Some more questions come to mind:
- Will the SDS data generate the same hierarchy cluster as the NP40 data?
- Is there a way to evaluate distance? Do the values mean anything? I guess you can average this and then divide by the average. This might give some sort of outlier information...
That's enough for today.
No comments:
Post a Comment