Tuesday, 7 July 2015

Using R to count words....

In order to analyse a piece of text to determine Elective locations from our 2015 Electives Map, I have written a script to gather the different regions of the world.

The script uses the library(stringi), particularly the function stri_count().
I have also been using lapply() and sapply().

Here is the script:

library(stringi)

setwd("/Users/paulbrennan/Documents/RforBiochemistsScripts/mapping")
data <- read.csv("electiveLocations2015V2.csv", header=T)
country <- data$country
# so this represents 68 different countries

country.count <- sapply(uniq.country, function (x) sum(stri_count(country, fixed=x)))

# set up the countries within each region
pac.is <- c("Solomon Islands", "Vanuatu", "Fiji", 
            "Tonga", "Samoa", "Cook Islands")
eur <- c("England", "Wales", "Scotland", "Northern Ireland",
         "Switzerland", "Italy", "Malta", "Cyprus", "Denmark", "Sweden")
n.am <- c("USA", "Mexico", "Canada", "United Stated of America")
c.am <- c("Panama", "Belize", "Guatamala", 
          "St Vincent", "Antigua", "Britis Virgin Islands", 
          "Cuba", "The Bahamas", "Tobago")
s.am <- c("Brazil", "Peru", "Ecuador")
e.afr <- c("Tanzania", "Kenya", "Uganda", "Rwanda",
           "Burundi", "Djibouti", "Eritrea", "Ethiopia", "Somalia",
           "Comoros", "Mauritius", "Seychelles", "RĂ©union", "Mayotte",
           "Mozambique", "Madagascar", "Malawi", "Zambia", "Zimbabwe",
           "Egypt", "Sudan", "South Sudan")
se.asia <- c("Cambodia", "Malaysia", "Thailand")

# some of the countries are fine by themselves
w.afr  <- c("Ghana")
s.afr <- c("South Africa")
nz <- c("New Zealand")
auz <- c("Australia")
nepal <- c("Nepal")
india <- c("India")
japan <- c("Japan")
china <- c("China")

# create a list of all the locations
locs <- list(pac.is, eur, 
         n.am, c.am, s.am, 
         w.afr, s.afr, e.afr,
         nz, auz, 
         nepal, india, se.asia,
         japan, china)

# use lapply to cycle through the whole list
# and then sapply to cycle through the items in each list. 
loc.count <- lapply(locs, function (y) sum(sapply(y, function (x) sum(stri_count(country, fixed=x)))))

# this is a list of the aggregate locations (includes indiv countries sometimes)
loc.names <- c( "Pacific Islands", "Europe", 
                "North America", "Central America", "South America",
                "West Africa", "South Africa", "East Africa", 
                "New Zealand", "Australia",
                "Nepal", "India", "South East Asia", 
                "Japan", "China")

# create a data.frame - requires changing from a List unlist()
elec.tot.2015 <- as.data.frame(loc.names)
elec.tot.2015$count <- unlist(loc.count)  #  need to unlist the loc.count

# calculate radius, smallest value = 1 
# area of a circle pi.r^2 - give area = 1, sqrt(1/pi)
# add to the data.frame
elec.tot.2015$radius <- sqrt(elec.tot.2015$count/pi)

# list of positions put into batch geocoder
# http://www.findlatitudeandlongitude.com/batch-geocode/
# paste into object called pos
pos <- read.delim(pipe("pbpaste"), header=T)

elec.tot.2015 <- merge(elec.tot.2015, pos, by.x = "loc.names", by.y = "original.address")

# write the data.frame as a tsv
setwd("/Users/paulbrennan/Dropbox/Public")
write.table(elec.tot.2015, file = "electivestotal2015.tsv", sep = "\t")


I ended up opening this in Excel, deleting the column of row numbers and changing the suffix to .txt

No comments:

Post a Comment