Using for loop and rbind to iterate over multiple files - r

I have a small R script of 14 functions and want to run it for 81 files. While I have read several posts on Stack Overflow that address similar issues, I am still having trouble getting this working. I am using a for loop and rbind.
All functions within the { } of the loop work. I have tested them without the for loop and I get the vector of data that I need. But when I run the for loop I only get an output for the last file in the folder. I am not sure what is going on.
Is the for loop working right (is it iterating through the files) and simply overwriting the previous runs? If the for loop is working then I assume I have a problem with my rbind. Or, is the for loop only running the last file in list.files()?
In the end, I want a matrix (or table) with the results of the 14 functions for each of the 81 files.
Here is the code:
res=(1:14)
for(i in list.files())
{
nd = read.csv(i, header= TRUE, row.names =1, check.names = FALSE)
mx = as.matrix(nd)
res[1]=basename(i)
res[2]=-99 #this is just a place holder
res[3]=gden(mx)
res[4]=centralization(mx,degree)
deg = degree(mx, gmode="graph", diag=FALSE, rescale=FALSE)
res[5]=mean(deg)
res[6]=sd(deg)
res[7]=max(deg)
res[8]=min(deg)
Ndeg = degree(mx, gmode="graph", diag=FALSE, rescale=TRUE)*1000
res[9]=mean(Ndeg)
res[10]=sd(Ndeg)
res[11]=max(Ndeg)
res[12]=min(Ndeg)
iso = isolates(mx, diag=FALSE)
res[13]=length(iso)
res[14]=nrow(mx)
}
results=rbind(res)
results

Make your set of functions together a new function and sapply it to every element of list.files():
out <- sapply(list.files(), function(i){
nd = read.csv(i, header= TRUE, row.names =1, check.names = FALSE)
mx = as.matrix(nd)
res = numeric(14)
res[1]=basename(i)
res[2]=-99 #this is just a place holder
res[3]=gden(mx)
res[4]=centralization(mx,degree)
deg = degree(mx, gmode="graph", diag=FALSE, rescale=FALSE)
res[5]=mean(deg)
res[6]=sd(deg)
res[7]=max(deg)
res[8]=min(deg)
Ndeg = degree(mx, gmode="graph", diag=FALSE, rescale=TRUE)*1000
res[9]=mean(Ndeg)
res[10]=sd(Ndeg)
res[11]=max(Ndeg)
res[12]=min(Ndeg)
iso = isolates(mx, diag=FALSE)
res[13]=length(iso)
res[14]=nrow(mx)
return(res)
}
out

you have to have rbind(res) inside the loop,something like this
results = rbind(res), but that is not enough. something like results = rbind(results,res)
It depends how you want to store them as an array of array etc..
You'd better also lapply sapply etc.. instead of loop

I also posted this question on my university listserv and a fellow student provided the following fixes. And now it works :)
res=(1:14)
summary=(1:14)
for(i in list.files())
{
....code as above.....
summary=rbind(summary, res)
}
summary
# then to put into a .csv
write.csv(summary, "nameoffile.csv")

Related

How to handle exception in R like in Java?

I am writing a server program in R where I upload PDF files and which later extracts data from the tables inside the pdf.
If required table and data is there, it works fine. But if not, it gives me error for files[i][[1]][[3]] and files[i][[1]][[4]].
error: subscript out of bounds
I want to default the value of buy and sell price at NA, if the table is not there.
all_data <- eventReactive(input$done, {
req(input$files)
files = {}
cost_price_list = {}
sell_price_list = {}
df <- data.frame(cost_price_list = character(), sell_price_list = character())
files <- lapply(input$files$datapath, extract_tables)
for (i in 1:length(input$files$datapath))
{
tryCatch(
{
cost_price_list <- files[i][[1]][[3]]
sell_price_list <- files[i][[1]][[4]]
},
error=function(cond) {
cost_price_list[i] = NA
sell_price_list[i] = NA
}
)
df[nrow(df) + 1,] <- c(cost_price_list[i],sell_price_list[i])
}
#return dataframe as table
df
})
But the above code doesn't work for me if the table is not there in pdf.
What am i doing wrong?
Please help.
There are multiple problems with your code.
First of, files = {} works but almost certainly doesn’t do what you intended. Did you mean files = list()? Otherwise, the idiomatic way of expressing what your code does is to write files = NULL in R. A more idiomatic way would be not to assign an empty object at all. In fact, your code overrides this initial value of files anyway, so files = {} is entirely redundant; and so are the next two assignments.
Next, since files is a list, you need to use double brackets to extract single elements from it: files[[i]], not files[i].
Another problem with your code is that assignment in R is always local. So cost_price_list[i] = NA creates a local variable inside the error handler function, even though cost_price_list exists in an outer scope.
If you want to assign to the outer variable, you need to explicitly specify the scope (or use <<-, but I recommend against this practice):
…
outer = environment()
for (i in 1:length(input$files$datapath))
{
tryCatch(
{
cost_price_list <- files[i][[1]][[3]]
sell_price_list <- files[i][[1]][[4]]
},
error=function(cond) {
outer$cost_price_list[i] = NA
outer$sell_price_list[i] = NA
}
)
df[nrow(df) + 1,] <- c(cost_price_list[i],sell_price_list[i])
}
…
But this still won’t work, because these variables do not have a value that can be meaningfully subset into (and it is not entirely clear what your code is attempting to do). Still, I hope the above gives you a foundation to work with.

Paste multiple elements in R

I make this code using a for-statement. (The main purpose of this code is to list different webpages, which are obtained via httr and rvest)
r = "asdgkjkhdf"
t = "osrt"
all = c()
for(i in 1:400)
{
y = paste(r, i, sep = '')
d = paste(y, t, sep = '')
all = c(all, d)
}
all
I got things like these (pasted numbers are actually getting accumulated in the each results)
[1]asdgkjkhdf1osrt
[2]asdgkjkhdf12osrt
[3]asdgkjkhdf123osrt
[4]asdgkjkhdf1234osrt
...
But I want results like these regardless of how many numbers i put in 'for()'function.
[1]asdgkjkhdf1osrt
[2]asdgkjkhdf2osrt
...
[400]asdgkjkhdf400osrt
like these above
What should I change in order to have what I want to result in?
Should I use paste(substr(), substr(), sep='')?
If you really want to use a for-statement you can use the following
r = "asdgkjkhdf"
t = "osrt"
all = c()
for (idx in 1:400)
all = c(all, paste0(r, idx, t))
However, in R you should prefer code without for-statements since, in general, this is less readable and hurts performance. The solution without the for-statement (given by Roland in the comments) equals
all <- paste0(r, 1:400, t)
Note that paste0("string")is just a short notation for paste("string", sep='').

How would you write this using apply family of functions in R? Should you?

Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.

memory allocation error while using mclapply

it's the first i use mclapply to run parallel script on multiple process, but the problem that i've tried the script on my laptop and it worked very well and filled the dataframe correctly, but now when i run the script on my office pc, when the printing ends and it's time to collect the data, the script stops with this error :
Error: cannot allocate vector of size 80 Kb
fun <- function(testdf) {
l=12000
errordf=data.frame()
errordf <- mclapply(1:nrow(15000), function(i)
{
for (ind in 1:nrow(testdf))
{
if( i >= l/2 ){
testdf[ind,]$X = testdf[ind,]$pos * 2
} else
{
testdf[ind,]$X = testdf[ind,]$pos/l
}
}
permdf <- testdf
lapply(1:100, function(j)
{ permdf$X<- sample(permdf$X,nrow(permdf), replace=FALSE)
fit=lm(X ~ gx, permdf) #linear regression calculation
regerror=sum(residuals(fit)^2)
data.frame(pc=i,error=regerror )
})
}, mc.cores=3)
res<-NULL
tmp <- lapply(errordf, function(ii){
tmp <- lapply(ii, function(ij){ #rbind the data and return the dataframe
res<<- rbind(res, ij)
})
})
return (res)
}
testdf example:
structure(list(ax = c(-0.0242214, 0.19770304, 0.01587302, -0.0374415,
0.05079826, 0.12209738), gx = c(-0.3913043, -0.0242214, -0.4259067,
-0.725, -0.0374415, 0.01587302), pos = c(11222, 13564, 16532,
12543, 12534, 14354)), .Names = c("ax", "gx", "pos"), row.names = c(NA,
-6L), class = "data.frame")
i'm sure that the code is working (that's why i did not included the full code), because i tried it multiple times on my laptop, but when i tries it on my office pc it lunch this error.
any help would be appreciatd
Right now you don't use the apply as intended in you last double nested lapply loop, you might as well use a for loop instead of using lapply combined with a global variable. In addition, you continuously grow res, this is rather memory and time intensive. Normally, an lapply loop would not suffer from this problem, but your use of a global variable totally negates the advantage. You seem to have a double nested list you want to rbind. I would defintely not loop over the data structure, I would just use something along the lines of do.call("rbind", data_structure) to do this, although it is hard to provide concrete advice without a reproducible example. This solution does not suffer from the continuous growing problem you experience.

combination of expand.grid and mapply?

I am trying to come up with a variant of mapply (call it xapply for now) that combines the functionality (sort of) of expand.grid and mapply. That is, for a function FUN and a list of arguments L1, L2, L3, ... of unknown length, it should produce a list of length n1*n2*n3 (where ni is the length of list i) which is the result of applying FUN to all combinations of the elements of the list.
If expand.grid worked to generate lists of lists rather than data frames, one might be able to use it, but I have in mind that the lists may be lists of things that won't necessarily fit into a data frame nicely.
This function works OK if there are exactly three lists to expand, but I am curious about a more generic solution. (FLATTEN is unused, but I can imagine that FLATTEN=FALSE would generate nested lists rather than a single list ...)
xapply3 <- function(FUN,L1,L2,L3,FLATTEN=TRUE,MoreArgs=NULL) {
retlist <- list()
count <- 1
for (i in seq_along(L1)) {
for (j in seq_along(L2)) {
for (k in seq_along(L3)) {
retlist[[count]] <- do.call(FUN,c(list(L1[[i]],L2[[j]],L3[[k]]),MoreArgs))
count <- count+1
}
}
}
retlist
}
edit: forgot to return the result. One might be able to solve this by making a list of the indices with combn and going from there ...
I think I have a solution to my own question, but perhaps someone can do better (and I haven't implemented FLATTEN=FALSE ...)
xapply <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
inds <- do.call(expand.grid,lapply(L,seq_along)) ## Marek's suggestion
retlist <- list()
for (i in 1:nrow(inds)) {
arglist <- mapply(function(x,j) x[[j]],L,as.list(inds[i,]),SIMPLIFY=FALSE)
if (FLATTEN) {
retlist[[i]] <- do.call(FUN,c(arglist,MoreArgs))
}
}
retlist
}
edit: I tried #baptiste's suggestion, but it's not easy (or wasn't for me). The closest I got was
xapply2 <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
xx <- do.call(expand.grid,L)
f <- function(...) {
do.call(FUN,lapply(list(...),"[[",1))
}
mlply(xx,f)
}
which still doesn't work. expand.grid is indeed more flexible than I thought (although it creates a weird data frame that can't be printed), but enough magic is happening inside mlply that I can't quite make it work.
Here is a test case:
L1 <- list(data.frame(x=1:10,y=1:10),
data.frame(x=runif(10),y=runif(10)),
data.frame(x=rnorm(10),y=rnorm(10)))
L2 <- list(y~1,y~x,y~poly(x,2))
z <- xapply(lm,L2,L1)
xapply(lm,L2,L1)
#ben-bolker, I had a similar desire and think I have a preliminary solution worked out, that I've also tested to work in parallel. The function, which I somewhat confusingly called gmcmapply (g for grid) takes an arbitrarily large named list mvars (that gets expand.grid-ed within the function) and a FUN that utilizes the list names as if they were arguments to the function itself (gmcmapply will update the formals of FUN so that by the time FUN is passed to mcmapply it's arguments reflect the variables that the user would like to iterate over (which would be layers in a nested for loop)). mcmapply then dynamically updates the values of these formals as it cycles over the expanded set of variables in mvars.
I've posted the preliminary code as a gist (reprinted with an example below) and would be curious to get your feedback on it. I'm a grad student, that is self-described as an intermediately-skilled R enthusiast, so this is pushing my R skills for sure. You or other folks in the community may have suggestions that would improve on what I have. I do think even as it stands, I'll be coming to this function quite a bit in the future.
gmcmapply <- function(mvars, FUN, SIMPLIFY = FALSE, mc.cores = 1, ...){
require(parallel)
FUN <- match.fun(FUN)
funArgs <- formals(FUN)[which(names(formals(FUN)) != "...")] # allow for default args to carry over from FUN.
expand.dots <- list(...) # allows for expanded dot args to be passed as formal args to the user specified function
# Implement non-default arg substitutions passed through dots.
if(any(names(funArgs) %in% names(expand.dots))){
dot_overwrite <- names(funArgs[which(names(funArgs) %in% names(expand.dots))])
funArgs[dot_overwrite] <- expand.dots[dot_overwrite]
#for arg naming and matching below.
expand.dots[dot_overwrite] <- NULL
}
## build grid of mvars to loop over, this ensures that each combination of various inputs is evaluated (equivalent to creating a structure of nested for loops)
grid <- expand.grid(mvars,KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
# specify formals of the function to be evaluated by merging the grid to mapply over with expanded dot args
argdefs <- rep(list(bquote()), ncol(grid) + length(expand.dots) + length(funArgs) + 1)
names(argdefs) <- c(colnames(grid), names(funArgs), names(expand.dots), "...")
argdefs[which(names(argdefs) %in% names(funArgs))] <- funArgs # replace with proper dot arg inputs.
argdefs[which(names(argdefs) %in% names(expand.dots))] <- expand.dots # replace with proper dot arg inputs.
formals(FUN) <- argdefs
if(SIMPLIFY) {
#standard mapply
do.call(mcmapply, c(FUN, c(unname(grid), mc.cores = mc.cores))) # mc.cores = 1 == mapply
} else{
#standard Map
do.call(mcmapply, c(FUN, c(unname(grid), SIMPLIFY = FALSE, mc.cores = mc.cores)))
}
}
example code below:
# Example 1:
# just make sure variables used in your function appear as the names of mvars
myfunc <- function(...){
return_me <- paste(l3, l1^2 + l2, sep = "_")
return(return_me)
}
mvars <- list(l1 = 1:10,
l2 = 1:5,
l3 = letters[1:3])
### list output (mapply)
lreturns <- gmcmapply(mvars, myfunc)
### concatenated output (Map)
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE)
## N.B. This is equivalent to running:
lreturns <- c()
for(l1 in 1:10){
for(l2 in 1:5){
for(l3 in letters[1:3]){
lreturns <- c(lreturns,myfunc(l1,l2,l3))
}
}
}
### concatenated outout run on 2 cores.
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE, mc.cores = 2)
Example 2. Pass non-default args to FUN.
## Since the apply functions dont accept full calls as inputs (calls are internal), user can pass arguments to FUN through dots, which can overwrite a default option for FUN.
# e.g. apply(x,1,FUN) works and apply(x,1,FUN(arg_to_change= not_default)) does not, the correct way to specify non-default/additional args to FUN is:
# gmcmapply(mvars, FUN, arg_to_change = not_default)
## update myfunc to have a default argument
myfunc <- function(rep_letters = 3, ...){
return_me <- paste(rep(l3, rep_letters), l1^2 + l2, sep = "_")
return(return_me)
}
lreturns <- gmcmapply(mvars, myfunc, rep_letters = 1)
A bit of additional functionality I would like to add but am still trying to work out is
cleaning up the output to be a pretty nested list with the names of mvars (normally, I'd create multiple lists within a nested for loop and tag lower-level lists onto higher level lists all the way up until all layers of the gigantic nested loop were done). I think using some abstracted variant of the solution provided here will work, but I haven't figured out how to make the solution flexible to the number of columns in the expand.grid-ed data.frame.
I would like an option to log the outputs of the child processesthat get called in mcmapply in a user-specified directory. So you could look at .txt outputs from every combination of variables generated by expand.grid (i.e. if the user prints model summaries or status messages as a part of FUN as I often do). I think a feasible solution is to use the substitute() and body() functions, described here to edit FUN to open a sink() at the beginning of FUN and close it at the end if the user specifies a directory to write to. Right now, I just program it right into FUN itself, but later it would be nice to just pass gmcmapply an argument called something like log_children = "path_to_log_dir. and then editing the body of the function to (pseudocode) sink(file = file.path(log_children, paste0(paste(names(mvars), sep = "_"), ".txt")
Let me know what you think!
-Nate

Resources