using wux for ensemble climate data - r

I am very new to R, and I am working with New England climate data. I am currently attempting to use the package WUX to find ensemble averages for each year of the mean, minimum, and maximum temperatures across all 29 climate models. In the end for example, I want to have a raster stack of model averages, one stack for each year. The ultimate goal is to obtain a graph that shows variability. I have attempted to read through the WUX.pdf online, but because I am so new, and because it is such a general overview I feel I am getting lost. I need help developing a simple framework to run the model. My script so far has a general outline of what I think I need. Correct me if I am wrong, but I think I want to be using the 'models2wux' function. Bare in mind that my script is a bit messy at this point.
#This script will:
#Calculate ensemble mean, min, and max across modules within each year.
#For example, the script will find the average temp across all modules
#for the year 1980. It will do the same for all years.
#There will be a separate ensemble mean, max, and min for each scenario
library(raster)
library(rasterVis)
library(utils)
library(wux)
library(lattice)
#wux information
#https://cran.r-project.org/web/packages/wux/wux.pdf
path <- "/net/nfs/merrimack/raid/Northeast_US_Downscaling_cmip5/"
vars <- c("tasmin", "tasmax") #, "pr")
mods <- c("ACCESS1-0", "ACCESS1-3", "bcc-csm1-1", "bcc-csm1-1-m")
#"CanESM2", "CCSM4", "CESM1-BGC", "CESM1-CAM5", "CMCC-CM",
#"CMCC-CMS", "CNRM-CM5", "CSIRO-Mk3-6-0", "FGOALS-g2", "GFDL-CM3",
#"GFDL-ESM2G", "GFDL-ESM2M", "HadGEM2-AO", "HadGEM2-CC", "HadGEM2-ES",
#"inmcm4", "IPSL-CM5A-LR", "IPSL-CM5A-MR", "MIROC5", "MIROC-ESM-CHEM",
#"MIROC-ESM", "MPI-ESM-LR", "MPI-ESM-MR", "MRI-CGCM3", "NorESM1-M")
scns <- c("rcp45", "rcp85") #, "historical")
#A character vector containing the names of the models to be processed
climate.models <- c(mods)
#ncdf file for important cities we want to look at (with lat/lon)
cities.path <-
"/net/home/cv/marina/Summer2017_Projects/Lat_Lon/NE_Important_Cities.csv"
necity.vars <- c("City", "State", "Population",
"Latitude", "Longitude", "Elevation(meters")
# package = wux -- models2wux
#models2wux(userinput, modelinput)
#modelinput information
#Start 4 Loops to envelope netcdf data
for (iv in 1:2){
for (im in 1:4){
for (is in 1:2){
for(i in 2006:2099){
modelinput <- paste(path, vars[iv], "_day_", mods[im], "_", scns[is], "_r1i1p1_", i, "0101-", i, "1231.16th.nc", sep="")
print(modelinput)
} # end of year loop
} # end of scenario loop
} # end of model loop
} # end of variable loop
# this line will print the full file name
print(full)
#more modelinput information necessary? List of models
# package = wux -- models2wux
#userinput information
parameter.names <- c("tasmin", "tasmax")
reference.period <- "2006-2099"
scenario.period <- "2006-2099"
temporal.aggregation <- #maybe don't need this
subregions <- # will identify key areas we want to look at (important cities?)
#uses projection file
#These both read the .csv file (first uses 'utils', second uses 'wux')
#1
cities.read <- read.delim(cities.path, header = T, sep = ",")
#2
read.table <- read.wux.table(cities.path)
cities.read <- subset(cities.read, subreg = "City", sep = ",")
# To read only "Cities", "Latitude", and "Longitude"
cities.points <- subset(cities.read, select = c(1, 4, 5))
cities.points <- as.data.frame(cities.points)
colnames(cities.points)<- c("City", "Latitude", "Longitude" )
#Set plot coordinates for .csv graph
coordinates(cities.points) <- ~ Longitude + Latitude
proj4string(cities.points) <- c("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
subregions <- proj4string(cities.points)
area.fraction <- # reread pdf on this (p.24)
#Do we want area.fraction = T or F ? (FALSE is default behavior)
spatial.weighting <- FALSE
#spatial.weighting = TRUE enables cosine weighting of latitudes, whereas
omitting or setting
#FALSE results in unweighted arithmetic areal mean (default). This option is
valid only for data on a regular grid.
na.rm = FALSE #keeps NA values
#plot.subregions refer to pdf pg. 25
#save.as.data saves data to specific file
# 1. use the brick function to read the full netCDF file.
# note: the varname argument is not necessary, but if a file has multiple varables, brick will read the first one by default.
air_t <- brick(full, varname = vars[iv])
# 2. use the calc function to get average, min, max for each year over the entire set of models
annualmod_ave_t <- calc(air_t, fun = mean, na.rm = T)
annualmod_max_t <- calc(air_t, fun = max, na.rm = T)
annualmod_min_t <- calc(air_t, fun = min, na.rm = T)
if(i == 2006){
annual_ave_stack = annualmod_ave_t
}else if{
annual_ave_stack <- stack(annual_ave_stack, annualmod_max_t))
}else{
annual_ave_stack <- stack(annual_ave_stack, annualmod_min_t)
} # end of if/else

Related

How can I process my StringTie data so that I can run DEseq2 using R?

I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`

Problems extracting metadata from NCBI in R

I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')

In R, calculate a dataframe as efficifiently and as fast as possible from thousands of external files

I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.
I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.
Here is a selection from a text file in the dataset:
from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6
My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.
In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,
I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f
My code is as follows:
# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)
# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"
#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")
# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
sp::spTransform(., app_crs) %>%
# preserve grid dataframe data in the fortify
{dplyr::left_join(ggplot2::fortify(.),
as.data.frame(.) %>%
dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
dplyr::select(-c(x, y))
The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-length(all_files)]
# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# The part of the code that needs to be fast and efficient
result <-
lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
data.table::rbindlist(., fill = TRUE) %>%
data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")
The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.
My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:
A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.
As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.
Thank you for your time.
Whenver I’m trying to figure out how to improve the performance of my R
functions, I generally use the following approach. First, I look for any
function calls that may be unesscesary or identify places where multiple
function calls can be simplified into one. Then, I look for places in my
code that are incurring the greatest time penalty by benchmarking each
part separately. This can easily be done using the microbenchmark
package.
For example, we can ask if we get better performance with or without
piping (e.g. %>%).
# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
pipe = iris %>% subset(Species=='setosa'),
no_pipe = subset(iris, Species=='setosa'),
times = 200)
Unit: microseconds
expr min lq mean median uq max neval cld
pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582 200 b
no_pipe 84.894 116.386 145.4039 126.1950 139.4100 612.492 200 a
Here, we find that removing subseting a data.frame without piping
takes nearly half the time to execute!
Next, I determine the net time penalty for each place I
benchmarked by multipling the execution time by total number of times it
needs to be executed. For the areas with the greatest net time penalty,
I try to replace it with faster functions and/or try reduce the total
number of times it needs to be executed.
TLDR
In your case, you can speed things up by using the fst package
although you would need to convert your csv files to fst files.
# before
TTM18_fetch <- function(x, col_range, origin_id) {
res <- data.table::fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
To convert your csv files to fst
library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-grepl('[Mm]eta', all_files)]
# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE)
file_names <- file_names[-grepl('[Mm]eta', file_names)]
file_names <- gsub(pattern = '.csv$',
replacement = '.fst',
x =file_names)
file_names <- file.path(new_ttm_path, file_names)
# csv to fst conversion
require(progress) # this will help you create track of things
pb <- progress_bar$new(
format = " :what [:bar] :percent eta: :eta",
clear = FALSE, total = length(file_names), width = 60)
# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(),
file_path = character())
for(i in seq_along(file_names)){
pb$tick(tokens = list(what = 'reading'))
tmp <- data.table::fread(all_files[i], key = 'from_id')
pb$update(tokens = list(what = 'writing'))
fst::write_fst(tmp,
compress = 50, # less compressed files read faster
path = file_names[i] )
pb$update(tokens = list(what = 'indexing'))
from_id_paths <- rbind(from_id_paths,
data.table(from_id = unique(tmp$from_id),
file_path = file_names[i]))
}
setkey(from_id_paths, from_id)
write_fst(from_id_paths,
path = file.path('new_ttm_path', 'from_id_index.fst'),
compress = 0)
This would be the replacement
library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
nescessary_files <- read_fst(path = file.path(new_ttm_path,
'from_id_index.fst'),
as.data.table = TRUE
)[from_id==origin_id,file_path]
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
result <- rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range, origin_id_num),
fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")

Using a loop to apply gmapsdistance to a list in R

I am trying to use the gmapsdistance package in R to calculate the journey time by public transport between a list of postcodes (origin) and a single destination postcode.
The output for a single query is:
$Time
[1] 5352
$Distance
[1] 34289
$Status
[1] "OK"
I actually have 2.5k postcodes to use but whilst I troubleshoot it I have set the iterations to 10. london1 is a dataframe containing a single column with 2500 postcodes in 2500 rows.
This is my attempt so far;
results <- for(i in 1:10) {
gmapsdistance::set.api.key("xxxxxx")
gmapsdistance::gmapsdistance(origin = "london1[i]"
destination = "WC1E 6BT"
mode = "transit"
dep_date = "2017-04-18"
dep_time = "09:00:00")}
When I run this loop I get
results <- for(i in 1:10) {
+ gmapsdistance::set.api.key("AIzaSyDFebeOppqSyUGSut_eGs8JcjdsgPBo8zk")
+ gmapsdistance::gmapsdistance(origin = "london1[i]"
+ destination = "WC1E 6BT"
Error: unexpected symbol in:
" gmapsdistance::gmapsdistance(origin = "london1[i]"
destination"
mode = "transit"
dep_date = "2017-04-18"
dep_time = "09:00:00")}
Error: unexpected ')' in " dep_time = "09:00:00")"
My questions are:
1)How can I fix this?
2) How do I need to format this, so the output is a dataframe or matrix containing the origin postcode and journey time
Thanks
There are a few things going on here:
"london[i]" needs to be london[i, 1]
you need to separate your arguments with commas ,
I get an error when using, e.g., "WC1E 6BT", I found it necessary to replace the space with a dash, like "WC1E-6BT"
the loop needs to explicitly assign values to elements of results
So your code would look something like:
library(gmapsdistance)
## some example data
london1 <- data.frame(postCode = c('WC1E-7HJ', 'WC1E-6HX', 'WC1E-7HY'))
## make an empty list to be filled in
results <- vector('list', 3)
for(i in 1:3) {
set.api.key("xxxxxx")
## fill in your results list
results[[i]] <- gmapsdistance(origin = london1[i, 1],
destination = "WC1E-6BT",
mode = "transit",
dep_date = "2017-04-18",
dep_time = "09:00:00")
}
It turns out you don't need a loop---and probably shouldn't---when using gmapsdistance (see the help doc) and the output from multiple inputs also helps in quickly formatting your output into a data.frame:
set.api.key("xxxxxx")
temp1 <- gmapsdistance(origin = london1[, 1],
destination = "WC1E-6BT",
mode = "transit",
dep_date = "2017-04-18",
dep_time = "09:00:00",
combinations = "all")
The above returns a list of data.frame objects, one each for Time, Distance and Status. You can then easily make those into a data.frame containing everything you might want:
res <- data.frame(origin = london1[, 1],
desination = 'WC1E-6BT',
do.call(data.frame, lapply(temp1, function(x) x[, 2])))
lapply(temp1, function(x) x[, 2]) extracts the needed column from each data.frame in the list, and do.call puts them back together as columns in a new data.frame object.

Looping over an API with a varying structure using R

I am using Graphite (http://graphite.wikidot.com/) to log performance statistics for various services, which we can access via an API. Each service has a few different metrics, and each metric has a few different statistics. To loop over all of them to grab the stats we want, I've written 3 nested for loops as shown below to create the necessary URL. And then it gets worse. We just introduced another level to this hierarchy because there can be more than one of each service, so they each need a unique ID. Before making this even messier, I am convinced there must be an easier way, but Googling hasn't turned up anything. Any ideas on the best way to approach it?
dir.current <- getwd()
dir.create(file.path(dir.current, "All Data"), showWarnings = FALSE)
dir.create(file.path(dir.current, "Charts"), showWarnings = FALSE)
# Set the grab parameters
graphite.ip <- "192.168.0.16:8080"
from <- list(hour="18", min="00", year="2013", month="09", day="18")
until <- list(hour="10", min="50", year="2013", month="09", day="19")
test.name <- "multinode"
# Builds the ugly parts of the URL.
graphite.ip <- paste("http://", graphite.ip, "/render?", sep="")
from <- paste("from=", from$hour, "%3A", from$min, "_", from$year, from$month, from$day, sep="")
until <- paste("&until=", until$hour, "%3A", until$min, "_", until$year, until$month, until$day, sep="")
test.name <- paste("&target=", test.name, sep="")
# A few variables for common statistics used.
stats.few <- c("count", "m1_rate", "m5_rate", "m15_rate", "mean_rate")
stats.many <- c("count", "m1_rate", "mean", "mean_rate", "p95", "stddev")
stats.memory <- c("total.used")
# Specify which metrics to grab for which services
engine.stats <- list("event-timer"=stats.many, "memory"=stats.memory)
journaler.stats <- list("journaler-rate"=stats.few, "memory"=stats.memory)
notification.stats <- list("notification-rate"=stats.few, "memory"=stats.memory, "reaction-tenant-one-PT4-time"=stats.many)
eventsin.stats <- list("Incoming"=stats.few, "memory"=stats.memory)
broker.stats <- list("memory"=stats.memory, "events"=stats.few)
# Specify which services you're interested in (should be above as well)
services <- list("engine"=engine.stats, "notification"=notification.stats, "rest"=eventsin.stats, "broker"=broker.stats)
merge.count <- 1
# Loops over everything above to grab the CSVs
for (service in names(services)) {
for (metric in names(services[[service]])) {
for (stat in services[[service]][[metric]]) {
target <- paste(test.name, service, metric, stat, sep=".")
data.name <- paste(service, metric, stat, sep=".")
print(data.name) # Visual indicator
# Download the graphs
url.png <- paste(graphite.ip, from, until, target, "&width=800&height=600", "&format=png", sep="")
setwd(file.path(dir.current, "Charts"))
download.file(url.png, paste(data.name, ".png", sep=""), quiet=TRUE)
# Download, clean and merge CSVs
url.csv <- paste(graphite.ip, from, until, target, "&format=csv", sep="")
data <- read.csv(url.csv, col.names = c("Data Name", "Date", data.name), header=FALSE)
data[1] <- NULL # Cleans up the data
# If a column has integers larger than 2^31, rewrite the data in millions.
if (sapply(data[2], max, na.rm=TRUE) >= 2^31) {
data[2] = data[2]/10^6
}
if (merge.count == 1) {
data.merged <- data
merge.count = merge.count + 1
} else {
data.merged = cbind(data.merged, data[2])
}
csv.name <- paste(service, metric, stat, "csv", sep=".")
setwd(file.path(dir.current, "All Data"))
write.csv(data, csv.name, row.names=FALSE)
}
}
}
setwd(file.path(dir.current))
write.csv(data.merged, "MergedData.csv", row.names=FALSE)
# Print summary of all statistics
# print(summary(data.merged))
# Print a mean and sd of all the columns
print("Column Means:")
print(colMeans(data.merged[,-1], na.rm=TRUE))
print("Column Standard Deviations:")
print(sapply(data.merged[,-1], sd, na.rm=TRUE))
print("Download and merging complete.")
Wildcards! The Graphite URL API supports usage of Perl based regexes that allow you to query the metric tree using wildcards.
If i have the following-
stats.A.A
stats.A.B
stats.A.C
stats.B.A.1
stats.B.A.2
stats.B.A.3
stats.C.B.C.D.1
stats.C.B.C.D.2
stats.C.B.C.D.3
stats.C.B.C.D.4
Then group(stats.*.*,stats.*.*.*,stats.*.*.*.*) will resolve into all of them. Another interesting function is groupByNode.
I think an issue with this is that it's a big loop that keeps cbind()ing data. A better approach would be to write a function that contains all the code within the inner loop and that takes service, metric, and stat as parameters. Let's call this function "process.stat". It returns data, or whatever you wanted to cbind.
First, you need to extract the service/metric/stat tuples:
# One column (service)
mat1 <- data.frame(service=names(services))
# List (one entry per service name) of service/metric pairs
list1 <- apply(df1, 1, function(service) expand.grid(service=service, metric=names(services[[service]])))
# Two columns (service and metric)
mat2 <- do.call(rbind, list1)
# List (one entry per service/metric pair) of service/metric/stat tuples
list2 <- apply(df2, 1, function(x) expand.grid(service=x[1], metric=x[2], stat=services[[x[1]]][[x[2]]]))
# Three columns (service, metric, and stat)
tuples <- do.call(rbind, list2)
Then you would use something from the apply family to call process.stat on every combination of service/metric/stat that you want handled:
data.merged <- apply(tuples, 1, process.stat)

Resources