Parallelize user-defined function using apply family in R - r

I have a script that takes too long to compute and I'm trying to paralellize its execution.
The script basically loops through each row of a data frame and perform some calculations as shown below:
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(df,df.id){
sum(df[df$id<=df.id,"value"])
}
for(i in 1:nrow(my.df)){
print(sumPrevious(my.df,my.df[i,"id"]))
}
I'm starting to learn to parallelize code in R, this is why I first want to understand how I could do this with an apply-like function (e.g. sapply,lapply,mapply).
I've tried multiple things but nothing worked so far:
mapply(sumPrevious,my.df,my.df$id) # Error in df$id : $ operator is invalid for atomic vectors

Using theparallel package in R you can use the mclapply() function. You will need to adjust your code a little bit to make it run in parallel.
library(parallel)
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(i,df){df.id = df$id[i]
sum(df[df$id<=df.id,"value"])
}
mclapply(X = 1:nrow(my.df),FUN = sumPrevious,my.df,mc.preschedule = T,mc.cores = no.of.cores)
This code will run the sumPrevious in parallel on no.of.cores in your machine.

Well, this is fun playing with. you kind need something like below:
mapply(sumPrevious,list(my.df),my.df$id)
For supply, since the first input is the dataframe, you will have to define a given function for it to be ale to recognize it so:
sapply(my.df$id,function(x,y) sumPrevious(y,x),my.df)
I prefer mapply here since we can set the first value to be imputed as the dataframe directly. But the whole of the dataframe. That's why you have to use the function list.
Map ia a wrapper of mapply and thus would just present the solution in a list format. try it. Also lapply is similar to sapply only that sapply would have to simplify the results into an array format while lapply would give the same results as a list.
Though it seems whatever you are trying to do can simply be done by a cumsum function.
cumsum(df$values)

Related

not error, but not results either in R

I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?
A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).
The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).

Looping in R to create transformed variables

I have a dataset of 80 variables, and I want to loop though a subset of 50 of them and construct returns. I have a list of the names of the variables for which I want to construct returns, and am attempting to use the dplyr command mutate to construct the variables in a loop. Specifically my code is:
for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") = (i - lag(i,1))/lag(i,1))}
where returnvars is my list, and alldta is my dataset. When I run this code outside the loop with just one of the `i' values, it works fine. The code for that looks like this:
alldta <- mutate(alldta,rVar = (Var- lag(Var,1))/lag(Var,1))
However, when I run it in the loop (e.g., attempting to do the previous line of code 50 times for 50 different variables), I get the following error:
Error: unexpected '=' in:
"for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") ="
I am unsure why this issue is coming up. I have looked into a number of ways to try and do this, and have attempted solutions that use lapply as well, without success.
Any help would be much appreciated! If there is an easy way to do this with one of the apply commands as well, that would be great. I did not provide a dataset because my question is not data specific, I'm simply trying to understand, as a relative R beginner, how to construct many transformed variables at once and add them to my data frame.
EDIT: As per Frank's comment, I updated the code to the following:
for (i in returnvars) {
varname <- paste("r",i,sep="")
alldta <- mutate(alldta,varname = (i - lag(i,1))/lag(i,1))}
This fixes the previous error, but I am still not referencing the variable correctly, so I get the error
Error in "Var" - lag("Var", 1) :
non-numeric argument to binary operator
Which I assume is because R sees my variable name Var as a string, rather than as a variable. How would I correctly reference the variable in my dataset alldta? I tried get(i) and alldta$get(i), both without success.
I'm also still open to (and actively curious about), more R-style ways to do this entire process, as opposed to using a loop.
Using mutate inside a loop might not be a good idea either. I am not sure if mutate makes a copy of the data frame but its generally not a good practice to grow a data frame inside a loop. Instead create a separate data frame with the output and then name the columns based on your logic.
result = do.call(rbind,lapply(returnvars,function(i) {...})
names(result) = paste("r",returnvars,sep="")
After playing around with this more, I discovered (thanks to Frank's suggestion), that the following works:
extended <- alldta # Make a copy of my dataset
for (i in returnvars) {
varname <- paste("r",i,sep="")
extended[[varname]] = (extended[[i]] - lag(extended[[i]],1))/lag(extended[[i]],1)}
This is still not very R-styled in that I am using a loop, but for a task that is only repeating about 50 times, this shouldn't be a large issue.

R: how to do multiple looping with sappy or lapply

I'm a beginner with R, and before then I used to use for loops
when doing macros.
However after learning R, I got to learn this interesting command sapply&
lapply but wondering how I can use this command for multiple looping.
For instance, when I was using for loop to perform simultaneous jobs,
I nested for loop in a for loop such as an example below:
for i in ~~~{
for j in ~~~~~
}
}
After learning sapply & lapply, I found out myself repeating same commands over and over since I don't know how to do multiple looping with these commands.
For example, below is the code for splitting file directory strings and return
7th and 8th chunk into the vector.
dir3<-sapply(strsplit(as.character(dir2),split="/",fixed=TRUE),function(x) (x[7]))
dir4<-as.list(dir3)
code<-do.call(rbind, dir4)
colnames(code)<-c("code")
dir5<-sapply(strsplit(as.character(dir2),split="/",fixed=TRUE),function(x) (x[8]))
dir6<-as.list(dir5)
fyear<-do.call(rbind, dir6)
colnames(fyear)<-c("fyear")
Is there any way I can perform the same task(=2nd looping) without copying the same command lines?
Thanks :)
You can just *apply the output of an *apply command.
I.e.:
sapply(
sapply( 1:10, function(x) x^2 ),
function(x) x^3
)
Obviously there's a better way to do the above example, but you get the point.
Another way would be to use mapply with expand.grid.

Dynamic input in R apply

I was wondering if the apply family could be used in R with a regressive input.
Say I have:
apply(MyMatrix,1,MyFunc,MyMatrix)
I know that apply is essentially a loop, so in the above example could it run one iteration of MyFunc over the first line of MyMatrix modifying MyMatrix globally and then select the modified MyMatrix for the next iteration ? I realize that normal loops could be used here but I just wanted to know if there is a way to do it like this.
Thanks
I don't believe so. Even modifying MyMatrix globally won't change the MyMatrix passed to your function. R functions don't operate that way. Your object is actually copied when it's passed into a function and a new instance of it exists then. It's not done by reference.
Unfortunately, the *apply family of functions are able to work in this manner. (This has been a frustration to me at times as well, but I've come to appreciate and work with it.)
There are two impediments to this:
The *apply family of functions deal with the value of MyMatrix when you make the call, iterate over the rows (in this example), and then join the results (based on the dimensions of each output). It is not re-evaluated each time.
Even if it did re-evaluate it, MyFunc is only given one row (in this example) at a time, not the whole matrix. (Your second reference to MyMatrix appears to be working around this.)
To do what I think you're saying, then your MyFunc function needs to accept as arguments the entire matrix and the row on which you are operating, and return just the row in question, ala:
MyFunc <- function(rownum, mtx) {
# ...
mtx[rownum,]
}
Using that premise, you could do:
for (rr in seq.int(nrow(MyMatrix))) {
MyMatrix[rr,] <- MyFunc(rr, MyMatrix)
}
or, if you must stay with the *apply family:
MyMatrix.new <- sapply(seq.int(nrow(MyMatrix)), MyFunc, MyMatrix)
You might want the transpose (t()) of the return from sapply() here.
If MyFunc returns the whole matrix instead of just one row, this can be done though a little differently.
I know of no way to directly do what you suggest.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

Resources