R: how to do multiple looping with sappy or lapply - r

I'm a beginner with R, and before then I used to use for loops
when doing macros.
However after learning R, I got to learn this interesting command sapply&
lapply but wondering how I can use this command for multiple looping.
For instance, when I was using for loop to perform simultaneous jobs,
I nested for loop in a for loop such as an example below:
for i in ~~~{
for j in ~~~~~
}
}
After learning sapply & lapply, I found out myself repeating same commands over and over since I don't know how to do multiple looping with these commands.
For example, below is the code for splitting file directory strings and return
7th and 8th chunk into the vector.
dir3<-sapply(strsplit(as.character(dir2),split="/",fixed=TRUE),function(x) (x[7]))
dir4<-as.list(dir3)
code<-do.call(rbind, dir4)
colnames(code)<-c("code")
dir5<-sapply(strsplit(as.character(dir2),split="/",fixed=TRUE),function(x) (x[8]))
dir6<-as.list(dir5)
fyear<-do.call(rbind, dir6)
colnames(fyear)<-c("fyear")
Is there any way I can perform the same task(=2nd looping) without copying the same command lines?
Thanks :)

You can just *apply the output of an *apply command.
I.e.:
sapply(
sapply( 1:10, function(x) x^2 ),
function(x) x^3
)
Obviously there's a better way to do the above example, but you get the point.
Another way would be to use mapply with expand.grid.

Related

How to loop through columns of a data.frame and use a function

This has probably been answered already and in that case, I am sorry to repeat the question, but unfortunately, I couldn't find an answer to my problem. I am currently trying to work on the readability of my code and trying to use functions more frequently, yet I am not that familiar with it.
I have a data.frame and some columns contain NA's that I want to interpolate with, in this case, a simple kalman filter.
require(imputeTS)
#some test data
col <- c("Temp","Prec")
df_a <- data.frame(c(10,13,NA,14,17),
c(20,NA,30,NA,NA))
names(df_a) <- col
#this is my function I'd like to use
gapfilling <- function(df,col){
print(sum(is.na(df[,col])))
df[,col] <- na_kalman(df[,col])
}
#this is my for-loop to loop through the columns
for (i in col) {
gapfilling(df_a, i)
}
I have two problems:
My for loop works, yet it doesn't overwrite the data.frame. Why?
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
You most definitely do not have to avoid for loops. What you should avoid is using a loop to perform actions that could be vectorized. Loops are in general just fine, however they are (much) slower compared to compiled languages such as c++, but are equivalent to loops in languages such as python.
My for loop works, yet it doesn't overwrite the data.frame. Why?
This is a problem with overwriting values within a function, or what is referred to as scope. Basically any assignment is restricted to its current environment (or scope). Take the example below:
f <- function(x){
a <- x
cat("a is equal to ", a, "\n")
return(3)
}
x <- 4
f(x)
a is equal to 4
[1] 3
print(a)
Error in print(a) : object 'a' not found
As you can see, "a" definitely exists, but it stops existing after the function call has been fulfilled. It is restricted to the environment (or scope) of the function. Here the scope is basically the time at which the function is run.
To alleviate this, you have to overwrite the value in the global environment
for (i in col) {
df_a[, i] <- gapfilling(df_a, i)
}
Now for readability (not speed) one could change this to a lapply
df_a[, col] <- lapply(df_a[, col], na_kalman)
I set a heavy point on it not being faster than using a loop. lapply iterates over each column, as you would in a loop. Speed could be obtained if say na_kalman was programmed to take multiple columns, and possibly save time using optimized c or c++ code.

Parallelize user-defined function using apply family in R

I have a script that takes too long to compute and I'm trying to paralellize its execution.
The script basically loops through each row of a data frame and perform some calculations as shown below:
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(df,df.id){
sum(df[df$id<=df.id,"value"])
}
for(i in 1:nrow(my.df)){
print(sumPrevious(my.df,my.df[i,"id"]))
}
I'm starting to learn to parallelize code in R, this is why I first want to understand how I could do this with an apply-like function (e.g. sapply,lapply,mapply).
I've tried multiple things but nothing worked so far:
mapply(sumPrevious,my.df,my.df$id) # Error in df$id : $ operator is invalid for atomic vectors
Using theparallel package in R you can use the mclapply() function. You will need to adjust your code a little bit to make it run in parallel.
library(parallel)
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(i,df){df.id = df$id[i]
sum(df[df$id<=df.id,"value"])
}
mclapply(X = 1:nrow(my.df),FUN = sumPrevious,my.df,mc.preschedule = T,mc.cores = no.of.cores)
This code will run the sumPrevious in parallel on no.of.cores in your machine.
Well, this is fun playing with. you kind need something like below:
mapply(sumPrevious,list(my.df),my.df$id)
For supply, since the first input is the dataframe, you will have to define a given function for it to be ale to recognize it so:
sapply(my.df$id,function(x,y) sumPrevious(y,x),my.df)
I prefer mapply here since we can set the first value to be imputed as the dataframe directly. But the whole of the dataframe. That's why you have to use the function list.
Map ia a wrapper of mapply and thus would just present the solution in a list format. try it. Also lapply is similar to sapply only that sapply would have to simplify the results into an array format while lapply would give the same results as a list.
Though it seems whatever you are trying to do can simply be done by a cumsum function.
cumsum(df$values)

R: mapply versus for loop: identical performance

I am trying to get the most optimized performance for a piece of code, but I am getting almost identical performance between mapply and for loops. Why is that? Would plyr or data.table be faster? Or is there a more efficient way to write my function?
The first chunk of code creates a list of 1000 lists, each nested list containing 1-10 random letters.
testlist<-list()
for(i in 1:1000){
testlist[i]<-list(paste(sample(c(letters),sample(1:10, 1))))}
The script that I am trying to optimize is one that counts the number of intersections across all possible combinations (1,000,000) of my nested lists. Below illustrates the mapply syntax I used for this.
#Function for mapply
intersectfunction<-function(x,y){
length(intersect(x,y))
}
#mapply syntax
T1<-Sys.time()
Intersects<-mapply(x=rep(testlist,length(testlist)),intersectfunction,y=rep(testlist,each=length(testlist)))
mapplytime<-Sys.time()-T1
Below illustrates a nested for loop syntax that produces essentially identical output.
T1<-Sys.time()
Intersects<-vector(length=length(testlist)^2)
for(i in 1:length(testlist)){
for(j in 1:length(testlist)){
Intersects[j+((i-1)*length(testlist))]<-length(intersect(testlist[[i]],testlist[[j]]))
}
}
forlooptime<-Sys.time()-T1
The weird thing is that each syntax takes almost the same amount of time, even though it seems like mapply should be more efficient. This suggests to me that I am either doing something wrong with mapply, or that mapply is not the right tool for accomplishing my goal.
> mapplytime
Time difference of 20.97202 secs
> forlooptime
Time difference of 23.29733 secs

Using a list of matrix names

I have 75 matrices that I want to search through. The matrices are named a1r1, a1r2, a1r3, a1r4, a1r5, a2r1,...a15r5, and I have a list with all 75 of those names in it; each matrix has the same number of rows and columns. Inside some nested for loops, I also have a line of code that, for the first matrix looks like this:
total <- (a1r1[row,i]) + (a1r1[row,j]) + (a1r1[row,k])
(i, j, k, and row are all variables that I am looping over.) I would like to automate this line so that the for loops would fully execute using the first matrix in the list, then fully execute using the second matrix and so on. How can I do this?
(I'm an experienced programmer, but new to R, so I'm willing to be told I shouldn't use a list of the matrix names, etc. I realize too that there's probably a better way in R than for loops, but I was hoping for sort of quick and dirty at my current level of R expertise.)
Thanks in advance for the help.
Here The R way to do this :
lapply(ls(pattern='a[0-9]r[0-9]'),
function(nn) {
x <- get(nn)
sum(x[row,c(i,j,k)])
})
ls will give a list of variable having a certain pattern name
You loop through the resulted list using lapply
get will transform the name to a varaible
use multi indexing with the vectorized sum function
It's not bad practice to build automatically lists of names designating your objects. You can build such lists with paste, rep, and sequences as 0:10, etc. Once you have a list of object names (let's call it mylist), the get function applied on it gives the objects themselves.

How to properly loop without eval, parse, text=paste("... in R

So I had a friend help me with some R code and I feel bad asking because the code works but I have a hard time understanding and changing it and I have this feeling that it's not correct or proper code.
I am loading files into separate R dataframes, labeled x1, x2... xN etc.
I want to combine the dataframes and this is the code we got to work:
assign("x",eval(parse(text=paste("rbind(",paste("x",rep(1:length(toAppend)),sep="",collapse=", "),")",sep=""))))
"toAppend" is a list of the files that were loaded into the x1, x2 etc. dataframes.
Without all the text to code tricks it should be something like:
x <- rbind(##x1 through xN or some loop for 1:length(toAppend)#)
Why can't R take the code without the evaluate text trick? Is this good code? Will I get fired if I use this IRL? Do you know a proper way to write this out as a loop instead? Is there a way to do it without a loop? Once I combine these files/dataframes I have a data set over 30 million lines long which is very slow to work with using loops. It takes more than 24 hours to run my example line of code to get the 30M line data set from ~400 files.
If these dataframes all have the same structure, you will save considerable time by using the 'colClasses' argument to the read.table or read.csv steps. The lapply function can pass this to read.* functions and if you used Dason's guess at what you were really doing, it would be:
x <- do.call(rbind, lapply(file_names, read.csv,
colClasses=c("numeric", "Date", "character")
)) # whatever the ordered sequence of classes might be
The reason that rbind cannot take your character vector is that the names of objects are 'language' objects and a character vector is ... just not a language type. Pushing character vectors through the semi-permeable membrane separating 'language' from 'data' in R requires using assign, or do.call eval(parse()) or environments or Reference Classes or perhaps other methods I have forgotten.

Resources