I am using Graphite (http://graphite.wikidot.com/) to log performance statistics for various services, which we can access via an API. Each service has a few different metrics, and each metric has a few different statistics. To loop over all of them to grab the stats we want, I've written 3 nested for loops as shown below to create the necessary URL. And then it gets worse. We just introduced another level to this hierarchy because there can be more than one of each service, so they each need a unique ID. Before making this even messier, I am convinced there must be an easier way, but Googling hasn't turned up anything. Any ideas on the best way to approach it?
dir.current <- getwd()
dir.create(file.path(dir.current, "All Data"), showWarnings = FALSE)
dir.create(file.path(dir.current, "Charts"), showWarnings = FALSE)
# Set the grab parameters
graphite.ip <- "192.168.0.16:8080"
from <- list(hour="18", min="00", year="2013", month="09", day="18")
until <- list(hour="10", min="50", year="2013", month="09", day="19")
test.name <- "multinode"
# Builds the ugly parts of the URL.
graphite.ip <- paste("http://", graphite.ip, "/render?", sep="")
from <- paste("from=", from$hour, "%3A", from$min, "_", from$year, from$month, from$day, sep="")
until <- paste("&until=", until$hour, "%3A", until$min, "_", until$year, until$month, until$day, sep="")
test.name <- paste("&target=", test.name, sep="")
# A few variables for common statistics used.
stats.few <- c("count", "m1_rate", "m5_rate", "m15_rate", "mean_rate")
stats.many <- c("count", "m1_rate", "mean", "mean_rate", "p95", "stddev")
stats.memory <- c("total.used")
# Specify which metrics to grab for which services
engine.stats <- list("event-timer"=stats.many, "memory"=stats.memory)
journaler.stats <- list("journaler-rate"=stats.few, "memory"=stats.memory)
notification.stats <- list("notification-rate"=stats.few, "memory"=stats.memory, "reaction-tenant-one-PT4-time"=stats.many)
eventsin.stats <- list("Incoming"=stats.few, "memory"=stats.memory)
broker.stats <- list("memory"=stats.memory, "events"=stats.few)
# Specify which services you're interested in (should be above as well)
services <- list("engine"=engine.stats, "notification"=notification.stats, "rest"=eventsin.stats, "broker"=broker.stats)
merge.count <- 1
# Loops over everything above to grab the CSVs
for (service in names(services)) {
for (metric in names(services[[service]])) {
for (stat in services[[service]][[metric]]) {
target <- paste(test.name, service, metric, stat, sep=".")
data.name <- paste(service, metric, stat, sep=".")
print(data.name) # Visual indicator
# Download the graphs
url.png <- paste(graphite.ip, from, until, target, "&width=800&height=600", "&format=png", sep="")
setwd(file.path(dir.current, "Charts"))
download.file(url.png, paste(data.name, ".png", sep=""), quiet=TRUE)
# Download, clean and merge CSVs
url.csv <- paste(graphite.ip, from, until, target, "&format=csv", sep="")
data <- read.csv(url.csv, col.names = c("Data Name", "Date", data.name), header=FALSE)
data[1] <- NULL # Cleans up the data
# If a column has integers larger than 2^31, rewrite the data in millions.
if (sapply(data[2], max, na.rm=TRUE) >= 2^31) {
data[2] = data[2]/10^6
}
if (merge.count == 1) {
data.merged <- data
merge.count = merge.count + 1
} else {
data.merged = cbind(data.merged, data[2])
}
csv.name <- paste(service, metric, stat, "csv", sep=".")
setwd(file.path(dir.current, "All Data"))
write.csv(data, csv.name, row.names=FALSE)
}
}
}
setwd(file.path(dir.current))
write.csv(data.merged, "MergedData.csv", row.names=FALSE)
# Print summary of all statistics
# print(summary(data.merged))
# Print a mean and sd of all the columns
print("Column Means:")
print(colMeans(data.merged[,-1], na.rm=TRUE))
print("Column Standard Deviations:")
print(sapply(data.merged[,-1], sd, na.rm=TRUE))
print("Download and merging complete.")
Wildcards! The Graphite URL API supports usage of Perl based regexes that allow you to query the metric tree using wildcards.
If i have the following-
stats.A.A
stats.A.B
stats.A.C
stats.B.A.1
stats.B.A.2
stats.B.A.3
stats.C.B.C.D.1
stats.C.B.C.D.2
stats.C.B.C.D.3
stats.C.B.C.D.4
Then group(stats.*.*,stats.*.*.*,stats.*.*.*.*) will resolve into all of them. Another interesting function is groupByNode.
I think an issue with this is that it's a big loop that keeps cbind()ing data. A better approach would be to write a function that contains all the code within the inner loop and that takes service, metric, and stat as parameters. Let's call this function "process.stat". It returns data, or whatever you wanted to cbind.
First, you need to extract the service/metric/stat tuples:
# One column (service)
mat1 <- data.frame(service=names(services))
# List (one entry per service name) of service/metric pairs
list1 <- apply(df1, 1, function(service) expand.grid(service=service, metric=names(services[[service]])))
# Two columns (service and metric)
mat2 <- do.call(rbind, list1)
# List (one entry per service/metric pair) of service/metric/stat tuples
list2 <- apply(df2, 1, function(x) expand.grid(service=x[1], metric=x[2], stat=services[[x[1]]][[x[2]]]))
# Three columns (service, metric, and stat)
tuples <- do.call(rbind, list2)
Then you would use something from the apply family to call process.stat on every combination of service/metric/stat that you want handled:
data.merged <- apply(tuples, 1, process.stat)
Related
I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.
I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.
Here is a selection from a text file in the dataset:
from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6
My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.
In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,
I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f
My code is as follows:
# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)
# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"
#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")
# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
sp::spTransform(., app_crs) %>%
# preserve grid dataframe data in the fortify
{dplyr::left_join(ggplot2::fortify(.),
as.data.frame(.) %>%
dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
dplyr::select(-c(x, y))
The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-length(all_files)]
# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# The part of the code that needs to be fast and efficient
result <-
lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
data.table::rbindlist(., fill = TRUE) %>%
data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")
The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.
My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:
A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.
As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.
Thank you for your time.
Whenver I’m trying to figure out how to improve the performance of my R
functions, I generally use the following approach. First, I look for any
function calls that may be unesscesary or identify places where multiple
function calls can be simplified into one. Then, I look for places in my
code that are incurring the greatest time penalty by benchmarking each
part separately. This can easily be done using the microbenchmark
package.
For example, we can ask if we get better performance with or without
piping (e.g. %>%).
# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
pipe = iris %>% subset(Species=='setosa'),
no_pipe = subset(iris, Species=='setosa'),
times = 200)
Unit: microseconds
expr min lq mean median uq max neval cld
pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582 200 b
no_pipe 84.894 116.386 145.4039 126.1950 139.4100 612.492 200 a
Here, we find that removing subseting a data.frame without piping
takes nearly half the time to execute!
Next, I determine the net time penalty for each place I
benchmarked by multipling the execution time by total number of times it
needs to be executed. For the areas with the greatest net time penalty,
I try to replace it with faster functions and/or try reduce the total
number of times it needs to be executed.
TLDR
In your case, you can speed things up by using the fst package
although you would need to convert your csv files to fst files.
# before
TTM18_fetch <- function(x, col_range, origin_id) {
res <- data.table::fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
To convert your csv files to fst
library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-grepl('[Mm]eta', all_files)]
# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE)
file_names <- file_names[-grepl('[Mm]eta', file_names)]
file_names <- gsub(pattern = '.csv$',
replacement = '.fst',
x =file_names)
file_names <- file.path(new_ttm_path, file_names)
# csv to fst conversion
require(progress) # this will help you create track of things
pb <- progress_bar$new(
format = " :what [:bar] :percent eta: :eta",
clear = FALSE, total = length(file_names), width = 60)
# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(),
file_path = character())
for(i in seq_along(file_names)){
pb$tick(tokens = list(what = 'reading'))
tmp <- data.table::fread(all_files[i], key = 'from_id')
pb$update(tokens = list(what = 'writing'))
fst::write_fst(tmp,
compress = 50, # less compressed files read faster
path = file_names[i] )
pb$update(tokens = list(what = 'indexing'))
from_id_paths <- rbind(from_id_paths,
data.table(from_id = unique(tmp$from_id),
file_path = file_names[i]))
}
setkey(from_id_paths, from_id)
write_fst(from_id_paths,
path = file.path('new_ttm_path', 'from_id_index.fst'),
compress = 0)
This would be the replacement
library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
nescessary_files <- read_fst(path = file.path(new_ttm_path,
'from_id_index.fst'),
as.data.table = TRUE
)[from_id==origin_id,file_path]
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
result <- rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range, origin_id_num),
fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")
I am trying to practice programming functions in R. I have made a function that allows me to determine which is the best pokemon for each attribute (e.g. attack, speed, defense etc.) per given type of pokemon (e.g. water, psychic, etc.). So far, I am only able to do this for one pokemon generation (reflected in one excel file). I want to do the same to include all the 6 generations (stored in 6 excel files). I have been working on the code for sometime... Maybe anybody here can give some inputs? Here is my current code in R for the said function:
bestpoke<-function(Type1, attri){
data <- read.csv("gen01.csv", colClasses = "character", header=TRUE)
dx <- as.data.frame(cbind(data[, 2], # Name
data[, 3], # Type1
data[, 6], # HP
data[, 7], # Attack
data[, 8], # Defense
data[, 9], # SpecialAtk
data[, 10], # SpecialDef
data[, 11]), # Speed
stringsAsFactors = FALSE)
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
## Check that name and attributes are valid
if(!Type1 %in% dx[, "Type1"]){
stop('invalid Type')
} else if(!attri %in% c("HP", "Attack", "Defense", "SpecialAtk", "SpecialDef", "Speed")){
stop('invalid attribute')
} else {
da <- which(dx[, "Type1"] == Type1)
db <- dx[da, ] # extracting data for the called state
dc <- as.numeric(db[, eval(attri)])
max_val <- max(dc, na.rm = TRUE)
result <- db[, "Name"][which(dc == max_val)]
output <- result[order(result)]
}
return(output) }
First thing, I believe that you can simplify your code a little, especially the first when you define the objects data and dx.
data <- read.csv("gen01.csv", stringsAsFactors = F)
You can use F or T instead of False or True. The read.csv() function has the stringsAsFactors parameter, so all character columns will be read as characters and the numeric columns will remain numeric. The header parameter in the function is True by default. Also, the output of the read.csv() function generally is a data.frame, so you don't have to use the as.data.frame function at that point. For more information run "?read.csv" in your R console. Then, you can use the following to subset the columns that you want:
dx <- data[, c(2, 3, 6:11)]
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
#This last line is just fine if you want to change the names of the columns.
#You can use the function names() instead, but I think that they work the same way.
Now, referring to the other generations, I think that you can practice with some conditionals if else. First install the xlsx package and readxl package so you don't have to necessarily export your excel files to csv format. You can try adding a new parameter "gen" or "generation" that requires a numeric input from 1 to 6, for example, and do something like this:
bestpoke<-function(Type1, attri, gen = 1){ # First generation will be the default
require(readxl)
if(gen == 1) {
data <- read.csv("gen01.csv", stringsAsFactors = F)
} else if(gen == 2) {
data <- read_xls("gen02.xls", stringsAsFactors = F)
} else if(gen == 3) { } # And so on...
dx <- data[, c(2, 3, 6:11)]
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
## From this point your function is the same
This is assuming that all the files have the same structure (number and order and type of columns).
You may want to try some other things, but if you are learning, it's better that you search for yourself some ways to accomplish them. This could be:
Reading and merging two or more generations (if the question is what is the best pokemon between two or more generations). In this case, the conditionals have to change.
Good luck!
I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))
I'm looking for some assistance in writing some R code to iterate through rows in a dataframe and pass the values in each row to a function and print the output either to an excel file, txt file or just in the console.
The purpose of this is to automate a bunch of distance/time queries (several hundred) to google maps using the function found at this website: http://www.nfactorialanalytics.com/r-vignette-for-the-week-finding-time-distance-between-two-places/
The function on that website is as follows:
library(XML)
library(RCurl)
distance2Points <- function(origin,destination){
results <- list();
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
results[['time']] <- time
results[['dist']] <- distance
return(results)
}
The dataframe will contain two columns: origin postal code and destination postal code (Canada, eh?). I'm a beginner R programmer, so I know how to use read.table to load a txt file into a dataframe. I'm just not sure how iterate through the dataframe, each time passing values to the distance2Points function and executing. I think this can be done using either a for loop or one of the apply calls?
Thanks for the help!
edit:
To keep it simple lets assume I want to transform these two vectors into a dataframe
> a <- c("L5B4P2","L5B4P2")
> b <- c("M5E1E5", "A2N1T3")
> postcodetest <- data.frame(a,b)
> postcodetest
a b
1 L5B4P2 M5E1E5
2 L5B4P2 A2N1T3
How should I go about iterating over these two rows to return both distances and times from the distance2Points function?
Here's one way to do it, using lapply to produce a list with the results for each row in your data and using Reduce(rbind, [yourlist]) to concatenate that list into a data frame whose rows correspond to the ones in your original. To make this work, we also have to tweak the code in the original function to return a one-row data frame, so I've done that here.
distance2Points <- function(origin,destination){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
# this gives you a one-row data frame instead of a list, b/c it's easy to rbind
results <- data.frame(time = time, distance = distance)
return(results)
}
# now apply that function rowwise to your data, using lapply, and roll the results
# into a single data frame using Reduce(rbind)
results <- Reduce(rbind, lapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i])))
Result when applied to your sample data:
> results
time distance
1 27.06667 27.062
2 1797.80000 2369.311
If you would prefer to do this without creating a new object, you could also write separate functions for computing time and distance -- or a single function with those outputs as options -- and then use sapply or just mutate to create new columns in your original data frame. Here's how that might look using sapply:
distance2Points <- function(origin, destination, output){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',
origin, '&destinations=', destination, '&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
if(output == "distance") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
y <- as.numeric(sub(" km", "", y))/1000
} else if(output == "time") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
y <- as.numeric(y)/60
} else {
y <- NA
}
return(y)
}
postcodetest$distance <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "distance"))
postcodetest$time <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "time"))
And here's how you could do it in a dplyr pipe with mutate:
library(dplyr)
postcodetest <- postcodetest %>%
mutate(distance = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "distance")),
time = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "time")))
I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)