Tuesday 28 July 2015

Draw a word cloud from the abstracts...

This bit of R script seems to work:

# make a word cloud of these abstracts
install.packages("tm")
install.packages("wordcloud")
library("tm")
library("wordcloud")

# combine the abs object into one string of words using the paste() function
# I've done this with a loop but I'm sure I could do it with an apply function
for (i in  1:length(abs)){
  abs.1 <- paste(abs.1, abs[i])
}

wordcloud(abs.1,
          max.words=100,
          random.order=FALSE, 
          rot.per=0.35, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Dark2"))

Here is the output:
It's pretty much what I expect so that's good news.
There are some words that could be removed but generally it's a good start, I think.

This method could be applied to a larger data set, I think.
IMPORTANTLY there are warnings. Some words not added because they 'won't fit on the page'. Need to be careful here.

I think it will be worthwhile to look at the frequency of the words.

Thursday 16 July 2015

download abstracts

I am using the package 'RISmed'.


Key steps:

Get the Pubmed IDs. 

This requires the function EUtilsSummary().
Here is the code to find Pubmed IDs for publications by C. Pepper over the last five years:

res <- EUtilsSummary("Pepper_C[au]", type='esearch', db='pubmed', mindate=2010, maxdate=2015)

This creates the object res which is a list of Pubmed IDs.
It called a "Formal class EUtilsSummary".
To find out a bit more about the object the following commands are useful:

> str(res)
Formal class 'EUtilsSummary' [package "RISmed"] with 6 slots
  ..@ db              : chr "pubmed"
  ..@ count           : num 77
  ..@ retmax          : num 77
  ..@ retstart        : num 0
  ..@ PMID            : chr [1:77] "25921075" "25897547" "25806611" "25752197" ...
  ..@ querytranslation: chr "Pepper_C[au] AND 2010[EDAT] : 2015[EDAT]"
> mode(res)
[1] "S4"
> summary(res)
Query:
Pepper_C[au] AND 2010[EDAT] : 2015[EDAT] 

Result count:  77


So usefully this tells us about the search we have done and the number of results. The results are in the @PMID part of the object. The object is of class S4. 

You can extract the individual PMIDs like this:
> res@PMID[1]
[1] "25921075"
> res@PMID[2]
[1] "25897547"
> res@PMID[3]
[1] "25806611"

Useful!!!

Downloading the abstracts


This res object can be used to download the abstracts using the EUtilsGet() function.

fetch <- EUtilsGet(res)

This returns an object called fetch
This is called "Large Medline (618.6 Kb)"
It contains lots of data including the abstracts. 

Extracting the abstracts

The abstracts can be extracted with the function AbstractText()
abs <- AbstractText(fetch)
Now we have 77 abstracts in a vector (I think).

> str(abs)
 chr [1:77] "OBJECTIVES: Ganglioneuroblastomas represent a histological subgroup of the rare neuroblastic tumours with intermediate malignan"| __truncated__ ...

> mode(abs)
[1] "character"

> abs[1]
[1] "OBJECTIVES: Ganglioneuroblastomas represent a histological subgroup of the rare neuroblastic tumours with intermediate malignant potential arising from neural crest progenitor cells of sympathetic nerves. Diagnosis can often be difficult based on imaging alone. We describe 4 cases of children presenting with a solitary neck mass with histology ultimately revealing ganglioneuroblastoma.METHODS: A retrospective case note review was carried out of all patients with cervical ganglioneuroblastoma seen at Great Ormond Street Hospital, UK.RESULTS: Mean age at presentation was 5 years. Based on imaging, the initial diagnoses for three of the cases were: lymphatic malformation, carotid body tumour, paraganglioma, respectively, whilst the remaining case had an immediate incisional biopsy revealing the correct diagnosis. All cases were managed by surgical excision with no evidence of recurrence after a median follow up of 6 years.CONCLUSION: Otolaryngologists should be aware of ganglioneuroblastoma when establishing the differential diagnosis of a child presenting with a neck mass. Biopsy is recommended as the gold standard investigation to avoid an incorrect diagnosis."
> 

Great. Good so far....



plotting the number of citations....

Here's what I did to find the number of papers published about chronic lymphocytic leukeamia.

# do search in pubmed on the web
# pubmed - "chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL"
# get an option to download a csv file

The csv file needs to have the top line removed. I did this by opening in Excel and saving it again.

setwd("/Users/paulbrennan/Dropbox/R for Biochemists/PubMed")
data <- read.csv("timeline2.csv")
sum(data[2])
# 17,517 publications


library(ggplot2)

# draw a graph
ggplot(data, aes(year, count)) +
  geom_line() +
  ylab("Publications") + # y-label
  xlab("Year") + # x-label
  theme_bw()

Here is the output:




Tuesday 7 July 2015

Using R to count words....

In order to analyse a piece of text to determine Elective locations from our 2015 Electives Map, I have written a script to gather the different regions of the world.

The script uses the library(stringi), particularly the function stri_count().
I have also been using lapply() and sapply().

Here is the script:

library(stringi)

setwd("/Users/paulbrennan/Documents/RforBiochemistsScripts/mapping")
data <- read.csv("electiveLocations2015V2.csv", header=T)
country <- data$country
# so this represents 68 different countries

country.count <- sapply(uniq.country, function (x) sum(stri_count(country, fixed=x)))

# set up the countries within each region
pac.is <- c("Solomon Islands", "Vanuatu", "Fiji", 
            "Tonga", "Samoa", "Cook Islands")
eur <- c("England", "Wales", "Scotland", "Northern Ireland",
         "Switzerland", "Italy", "Malta", "Cyprus", "Denmark", "Sweden")
n.am <- c("USA", "Mexico", "Canada", "United Stated of America")
c.am <- c("Panama", "Belize", "Guatamala", 
          "St Vincent", "Antigua", "Britis Virgin Islands", 
          "Cuba", "The Bahamas", "Tobago")
s.am <- c("Brazil", "Peru", "Ecuador")
e.afr <- c("Tanzania", "Kenya", "Uganda", "Rwanda",
           "Burundi", "Djibouti", "Eritrea", "Ethiopia", "Somalia",
           "Comoros", "Mauritius", "Seychelles", "RĂ©union", "Mayotte",
           "Mozambique", "Madagascar", "Malawi", "Zambia", "Zimbabwe",
           "Egypt", "Sudan", "South Sudan")
se.asia <- c("Cambodia", "Malaysia", "Thailand")

# some of the countries are fine by themselves
w.afr  <- c("Ghana")
s.afr <- c("South Africa")
nz <- c("New Zealand")
auz <- c("Australia")
nepal <- c("Nepal")
india <- c("India")
japan <- c("Japan")
china <- c("China")

# create a list of all the locations
locs <- list(pac.is, eur, 
         n.am, c.am, s.am, 
         w.afr, s.afr, e.afr,
         nz, auz, 
         nepal, india, se.asia,
         japan, china)

# use lapply to cycle through the whole list
# and then sapply to cycle through the items in each list. 
loc.count <- lapply(locs, function (y) sum(sapply(y, function (x) sum(stri_count(country, fixed=x)))))

# this is a list of the aggregate locations (includes indiv countries sometimes)
loc.names <- c( "Pacific Islands", "Europe", 
                "North America", "Central America", "South America",
                "West Africa", "South Africa", "East Africa", 
                "New Zealand", "Australia",
                "Nepal", "India", "South East Asia", 
                "Japan", "China")

# create a data.frame - requires changing from a List unlist()
elec.tot.2015 <- as.data.frame(loc.names)
elec.tot.2015$count <- unlist(loc.count)  #  need to unlist the loc.count

# calculate radius, smallest value = 1 
# area of a circle pi.r^2 - give area = 1, sqrt(1/pi)
# add to the data.frame
elec.tot.2015$radius <- sqrt(elec.tot.2015$count/pi)

# list of positions put into batch geocoder
# http://www.findlatitudeandlongitude.com/batch-geocode/
# paste into object called pos
pos <- read.delim(pipe("pbpaste"), header=T)

elec.tot.2015 <- merge(elec.tot.2015, pos, by.x = "loc.names", by.y = "original.address")

# write the data.frame as a tsv
setwd("/Users/paulbrennan/Dropbox/Public")
write.table(elec.tot.2015, file = "electivestotal2015.tsv", sep = "\t")


I ended up opening this in Excel, deleting the column of row numbers and changing the suffix to .txt