Visualisation and Data: Playing with Suliman's proteomic data.....

I have decided to start simply in R.
I have opened a new project. This clears all the data from the memory which looks like a clean start.

Steps:

Get the data in.. NP40 = read.csv("...")

First time I did this, I used read.txt and the data was in the wrong format!

Look at the data.. head (NP40)
Attach(NP40) - then I can use the headings. This is faster than using $ all the time.
Play a bit - plot(Pat_1~Pat_2)
Try default distance measures dist(rbind(Pat_1, Pat_2, Pat_3....)
rbind combines the data by rows and is required for the dist function
Put these in a matrix themselves distances <- dist(rbind(Pat_1, Pat_2....)
Do hierarchy clustering hc <- hclust(distances) and then plot(hc)

So this all worked and made a cluster and gave some distances.

Distances:

Pat_1 Pat_2 Pat_3 Pat_4 Pat_5 Pat_6 Pat_7 Pat_8 Pat_9 Pat_10

Pat_2 7.537715

Pat_3 3.875914 6.809347

Pat_4 5.263328 8.908621 4.036732

Pat_5 6.415988 7.497405 5.970487 6.052602

Pat_6 5.335291 7.945542 5.344391 5.737485 5.301845

Pat_7 9.638823 11.860930 8.369786 8.654071 12.813030 13.373229

Pat_8 10.546800 7.906824 10.610556 11.005512 10.942666 11.187895 12.030716

Pat_9 5.540512 6.480492 5.825622 7.233939 5.790468 4.195550 10.479076 9.158402

Pat_10 4.815430 8.433853 5.014136 7.109553 5.986761 4.804471 12.413832 11.686438 5.207372

Pat_11 4.632951 8.585297 5.024850 6.874530 5.928018 4.487083 12.188807 11.433096 4.944108 2.930767

The distances vary from 2.93 Patient 10 vs Patient 11 to 13.4 Patient 6 vs Patient 7.

> summary(distances)

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.931 5.319 6.875 7.603 10.060 13.370

If we plot these two it generates interesting plots:

Plot of closest Patient samples (Pat_10 vs Pat_11)

r-squared = 0.49, p<0.0001

Plot of farthest patient samples (Pat_6 vs Pat_7)

r-squared = -0.002. p=0.43

I decided to plot the residuals of this to see if there are outliers in the data set.

>largedist.res = resid(largedist)

This function resid calculates the residuals

> plot(largedist.res)

gives...

I think anything with a residual above 0.5 is probably interesting and may merit some investigation. Certainly, the two dots above 1.0 merit a look....

I wanted to compare a third set of data that used different patients samples but still had a large distance. I chose to

>plot(Pat4~Pat8)

which gave:

r-squared=0.15, p>0.001

This is interesting as event though the distance is relatively large and the r-squared is quite small, the correlation is significant.

Residual graph is here:

There maybe some interesting proteins here.

Sadly, it is not immediately easy to work out which protein is which as we lose the protein IDs during the process. The plots don't require this.

Interesting afternoon with lots to think about.

Some nice productive playing with R.

Some more questions come to mind:

Will the SDS data generate the same hierarchy cluster as the NP40 data?
Is there a way to evaluate distance? Do the values mean anything? I guess you can average this and then divide by the average. This might give some sort of outlier information...

That's enough for today.

Visualisation and Data

Wednesday, 7 May 2014

Playing with Suliman's proteomic data.....

No comments:

Post a Comment