Related
I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.
I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.
Here is a selection from a text file in the dataset:
from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6
My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.
In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,
I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f
My code is as follows:
# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)
# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"
#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")
# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
sp::spTransform(., app_crs) %>%
# preserve grid dataframe data in the fortify
{dplyr::left_join(ggplot2::fortify(.),
as.data.frame(.) %>%
dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
dplyr::select(-c(x, y))
The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-length(all_files)]
# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# The part of the code that needs to be fast and efficient
result <-
lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
data.table::rbindlist(., fill = TRUE) %>%
data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")
The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.
My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:
A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.
As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.
Thank you for your time.
Whenver I’m trying to figure out how to improve the performance of my R
functions, I generally use the following approach. First, I look for any
function calls that may be unesscesary or identify places where multiple
function calls can be simplified into one. Then, I look for places in my
code that are incurring the greatest time penalty by benchmarking each
part separately. This can easily be done using the microbenchmark
package.
For example, we can ask if we get better performance with or without
piping (e.g. %>%).
# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
pipe = iris %>% subset(Species=='setosa'),
no_pipe = subset(iris, Species=='setosa'),
times = 200)
Unit: microseconds
expr min lq mean median uq max neval cld
pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582 200 b
no_pipe 84.894 116.386 145.4039 126.1950 139.4100 612.492 200 a
Here, we find that removing subseting a data.frame without piping
takes nearly half the time to execute!
Next, I determine the net time penalty for each place I
benchmarked by multipling the execution time by total number of times it
needs to be executed. For the areas with the greatest net time penalty,
I try to replace it with faster functions and/or try reduce the total
number of times it needs to be executed.
TLDR
In your case, you can speed things up by using the fst package
although you would need to convert your csv files to fst files.
# before
TTM18_fetch <- function(x, col_range, origin_id) {
res <- data.table::fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
To convert your csv files to fst
library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-grepl('[Mm]eta', all_files)]
# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE)
file_names <- file_names[-grepl('[Mm]eta', file_names)]
file_names <- gsub(pattern = '.csv$',
replacement = '.fst',
x =file_names)
file_names <- file.path(new_ttm_path, file_names)
# csv to fst conversion
require(progress) # this will help you create track of things
pb <- progress_bar$new(
format = " :what [:bar] :percent eta: :eta",
clear = FALSE, total = length(file_names), width = 60)
# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(),
file_path = character())
for(i in seq_along(file_names)){
pb$tick(tokens = list(what = 'reading'))
tmp <- data.table::fread(all_files[i], key = 'from_id')
pb$update(tokens = list(what = 'writing'))
fst::write_fst(tmp,
compress = 50, # less compressed files read faster
path = file_names[i] )
pb$update(tokens = list(what = 'indexing'))
from_id_paths <- rbind(from_id_paths,
data.table(from_id = unique(tmp$from_id),
file_path = file_names[i]))
}
setkey(from_id_paths, from_id)
write_fst(from_id_paths,
path = file.path('new_ttm_path', 'from_id_index.fst'),
compress = 0)
This would be the replacement
library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
nescessary_files <- read_fst(path = file.path(new_ttm_path,
'from_id_index.fst'),
as.data.table = TRUE
)[from_id==origin_id,file_path]
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
result <- rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range, origin_id_num),
fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")
I have 7 of two different asc files loaded into R, asc[i] and wasc[i], [i] denotes that there are 1:7 ascs and wascs loaded into R. I need to combine the wasc[i] with the asc[i][[1]] (Just the first column in asc[i] with the whole wasc[i] file).
This should be repeated for every pair of asc and wasc files.
The code keeps giving me blank data frames, so I don't know why this doesn't work. The naming is correct, yet the code is not recognizing that the asc[i] and wasc[i] correlate with previously loaded files.
Any help will be greatly appreciated.
# These data frames will reproduce my issue
asc1 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc1 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc2 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc2 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc3 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc3 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
for (i in 1:3) {
d <- paste("asc", i, sep ="")
f <- paste("wasc", i, sep ="")
full_wing <- as.character(paste("full_wing", i, sep = ""))
assign(full_wing,cbind(d[[1]], f))
}
# Output of full_wing1 data frame
dput(full_wing1)
structure(c("asc1", "wasc1"), .Dim = 1:2, .Dimnames = list(NULL,
c("", "f")))
Additional Information:
asc files are 19 columns long
wasc files are 13 columns long
I only want to combine column 1 from the asc file with the entire wasc file, thus cutting out the remaining 18 columns of the asc file.
# put data in a list
asc = mget(ls(pattern = "^asc"))
wasc = mget(ls(pattern = "^wasc"))
full_wing = Map(f = function(w, a) cbind(w, a[[1]]), w = wasc, a = asc)
Map is a nice shortcut for iterating in parallel over multiple arguments. It returns a nice list. You can access the individual elements with, e.g., full_wing[[1]], full_wing[[3]], etc. Map is just a shortcut, the above code is basically equivalent to the for loop below:
results = list()
for (i in seq_along(asc)) {
results[[i]] = cbind(wasc[[i]], asc[[i]][[1]])
}
I use mget to put the data in a list because in your example you already have objects like asc1, asc2, etc. A much better way to go is to never create those variables in the first place, instead read the files directly into a list, something like this:
asc_paths = list.files(pattern = "^asc")
asc = lapply(asc_paths, read.table)
You can see a lot more explanation of this at How to make a list of data frames?
If you only ever need one column of the asc files, another way to simplify this would be to only read in the needed column, see Only read limited number of columns for some recommendations there.
I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)
In R, I need to return two objects from a function:
myfunction()
{
a.data.frame <- read.csv(file = input.file, header = TRUE, sep = ",", dec = ".")
index.hash <- get_indices_function(colnames(a.data.frame))
alist <- list("a.data.frame" = a.data.frame, "index.hash" = index.hash)
return(alist)
}
But, the returned objects from myfunction all become list not data.frame and hash.
Any help would be appreciated.
You can only return one object from an R function; this is consistent with..pretty much every other language I've used. However, you'll note that the objects retain their original structure within the list - so alist[[1]] and alist[[2]] should be the data frame and hash respectively, and are structured as data frames and hashes. Once you've returned them from the function, you can split them out into unique objects if you want :).
You can use a structure.
return (structure(class = "myclass",
list(data = daza.frame,
type = anytype,
page.content = page.content.as.string.vector,
knitr = knitr)))
Than you can access your data with
values <- my function(...)
values$data
values$type
values$page.content
values$knitr
and so on.
A working example from my package:
sju.table.values <- function(tab, digits=2) {
if (class(tab)!="ftable") tab <- ftable(tab)
tab.cell <- round(100*prop.table(tab),digits)
tab.row <- round(100*prop.table(tab,1),digits)
tab.col <- round(100*prop.table(tab,2),digits)
tab.expected <- as.table(round(as.array(margin.table(tab,1)) %*% t(as.array(margin.table(tab,2))) / margin.table(tab)))
# -------------------------------------
# return results
# -------------------------------------
invisible (structure(class = "sjutablevalues",
list(cell = tab.cell,
row = tab.row,
col = tab.col,
expected = tab.expected)))
}
tab <- table(sample(1:2, 30, TRUE), sample(1:3, 30, TRUE))
# show expected values
sju.table.values(tab)$expected
# show cell percentages
sju.table.values(tab)$cell
Is there a way to make matching values at scale more programmatic? Basically what I want to do is add a bunch of columns for value lookups onto a dataframe, but I don't want to write the match[] argument every time. It seems like this would be a use case for mapply but I can't quite figure out how to use it here. Any suggestions?
Here's the data:
data <- data.frame(
region = sample(c("northeast","midwest","west"), 50, replace = T),
climate = sample(c("dry","cold","arid"), 50, replace = T),
industry = sample(c("tech","energy","manuf"), 50, replace = T))
And the corresponding lookup tables:
lookups <- data.frame(
orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
)
So now what I want to do is: First add a column to "data" that's called "reg_lookups" and it will match the region to its appropriate value in "lookups". Do the same for "climate_lookups" and so on.
Right now, I've got this mess:
data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]
I've tried using a function to do this, but the function doesn't seem to work, so then applying that to mapply is a no-go (plus I'm confused about how the mapply syntax would work here):
match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
return(df)
}
data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)
I think you're just trying to do this:
data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)
But it would be much better to store the lookups either in separate data frames. That way you can control the names of the new columns more easily. It would also allow you to do something like this:
lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')
do.call(cbind,mapply(merge,
x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
y = lookups1,
moreArgs = list(all.x = TRUE),
SIMPLIFY = FALSE))
and you should be able to wrap that do.call bit in a function.
I used data[,1,drop = FALSE] in order to preserve them as one column data frames.
The way you structure mapply calls is to pass named arguments as lists (the x = and y = parts). I wanted to be sure to preserve all the rows from data, so I passed all.x = TRUE via moreArgs, so that gets passed each time merge is called. Finally, I need to stitch them all together myself, so I turned off SIMPLIFY.