How can I use lapply() to "loop" over a multi-column dataset and apply a function? Normally, I would use rollapply(), but for reasons that aren't worth going into the analytics in this case only works with lapply(). I know how to run a function over an expanding window. But how can lapply() be used with a sliding window? For example, here's a toy example for manually changing the range works with a function I'll call my_fun for a multi-column dataset (dat1):
set.seed(78)
dat1 <- as.data.frame(matrix(rnorm(1000), ncol = 20, nrow = 50))
my_fun <-function(x) {
a <-apply(x,1,mean)
}
test.1 <-my_fun(dat1[1:10])
test.2 <-my_fun(dat1[2:11])
test.3 <-my_fun(dat1[3:12])
Using lapply() for an expanding window works too, i.e., for ranges 1:10, 1:11, 1:12:
test.a <-lapply(seq(10, 12), function(x) my_fun(dat1[1:x]))
My question: is there any way to use lapply to replicate the sliding window analysis via the 3 manual examples above? I've tried several possibilities, using rep() and replicate(), for example, but so far no success. Any insight would be greatly appreciated.
test.a <-lapply(seq(1, 3), function(x) my_fun(dat1[x:(x+9)]))
In fact, it can be done with rollapply like this:
library(zoo)
res <- t(rollapply(t(dat1), 10, function(x) my_fun(t(x)), by.column = FALSE))
# verify that res[, i] equals test.i for i = 1,2,3
all.equal(res[, 1], test.1)
## [1] TRUE
all.equal(res[, 2], test.2)
## [1] TRUE
all.equal(res[, 3], test.3)
## [1] TRUE
Related
I know this is a bonehead newbie question, but I've been trying to figure it out for quite awhile and need some input. Basically, I'm trying to learn how to use the apply family to omit for loops, specifically how to set up the call so that columns of a matrix serve as arguments to the function. I'll use a simple call to the rbinom function as an example.
Example: this for loop works fine. The data are a set of integers and a set of probabilities
success <- rep(-1, times=10) # initialize result var
num <- sample.int(20, 10) # get 10 random integers
p <- runif(10) # get 10 random probabilities
for (i in 1:10) {
success[i]= rbinom(n=1, size=num[i],prob=p[i]) # number successes in 1 trial
}
But how to do the same thing with the apply family? I first put the data into 2 columns of a matrix, thinking that was the right start. However, the following does NOT work, obviously due to my
poor understanding of how to set up a call to apply.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
success <- apply(myData, rbinom, n=1, size=myData[,1], prob=myData[,2])
Any tips are greatly appreciated! I'm coming to R from Fortran, and trying to port over a lot of code that is loaded with DO loops, so I really need to get my head around this.
lapply, sapply, apply only deal with one vector/list at a time. That is, apply will only call its function for one column at a time. What you need is mapply or Map.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
mapply(rbinom, n = 1, myData[,1], myData[,2])
# [1] 5 4 11 8 3 3 17 8 0 11
Just like lapply returns a list, so does Map; similarly, just like sapply, mapply will return a vector or array if all return values are compatible, otherwise it returns a list as well.
These calls are equivalent:
sapply(1:3, function(z) z + 1)
mapply(function(z) z + 1, 1:3)
but mapply and Map allow arbitrary number of lists/vectors, so for instance
func <- function(X,Y,Z) X^2+2*Y-Z
Map(func, 1:9, 11:19, 21:29)
## effectively the same as
list(
func(1, 11, 21),
func(2, 12, 22),
func(3, 13, 33),
...,
func(9, 19, 29)
)
The equivalent call of that with sapply for your data would be
sapply(seq_len(nrow(myData)), function(ind) {
rbinom(n = 1, size = myData[ind,1], prob = myData[ind,2])
})
though I personally feel that mapply is easier to read.
I am a beginner with R programming. Recently I wrote a user-defined function as follows:
foo <- function(x){
power <- 1:4
sum(x^power)
}
This function works fine when x is a single number. For example, when x = 1, the result is 4 and when x = 10 the result is 11110. However, this function doesn't work with vectors. For example, when x <- c(1, 10), the result is 10102 which is not what I want. My desire result is a vector such as 4 11110. I know this problem can be solved by using sapply() on function or add a for-loop inside the function, but I think there might be another way to rewrite the function without using loops or "apply" functions. I have tried different ways to rewrite the function but nothing works, can somebody help me to solve the problem? Thanks!
Mathematically, a simple and more straightforward approach is to rewrite foo function like below
foo <- function(x) {
power <- 1:4
ifelse(x==1,max(power),x*(x**(max(power))-1)/(x-1))
}
which gives
> foo(c(1,10))
[1] 4 11110
I don't think there is a way to avoid any kind of implicit or explicit loop since power is a vector and you are passing x to it which is another vector.
Here are few options :
Your best bet is sapply (which you have already figured out).
sapply(c(1, 10), foo)
#[1] 4 11110
Another way is to use Vectorize where you cannot "see" the loop but it still loop beneath as it is a wrapper to mapply.
Vectorize(foo)(c(1, 10))
#[1] 4 11110
Using outer :
foo <- function(x){
power <- 1:4
rowSums(outer(x, power, `^`))
}
foo(c(1, 10))
#[1] 4 11110
and obviously you can write a simple for loop as well and pass c(1, 10) to it.
This works:
foo <- function(x, power = 1:4){
ind <- 1 + seq_along(power)
power <- matrix(rep(power, length(x)), nrow = length(x), byrow = T)
x <- as.matrix(x)
m <- cbind(x, power)
m <- m[, 1]^m[, ind]
v <- rowSums(m)
return(v)
}
foo(x = c(1, 10))
## [1] 4 11110
Runs about 8.5x faster than using sapply(x foo) (when foo is a vector of length == 1,000,000). It's a bit late here, so I don't know whether you could optimise the internals a little better.
I've got an interesting problem and have no idea where to begin -- in fact, I wasn't even sure how to title the question! What I want to do is apply functions to elements of a dataframe and use these to make new rows in a new dataframe. For example, suppose we have a dataframe df1 that gives some X and Y data for various States:
df1 <- data.frame(State=c("AL","AK"), X=c(1,3), y=c(2,4))
What I would like to do is start with the first state AL, and make a new dataframe df2with 3 rows, where the new values of df2$X are calculated using 3 different functions to give, for example: df1$X, df1$X - 1, and df1$X + 1. Likewise, I want to do a similar thing for new values of df2$Y, which in this example are calculated as df1$Y, df1$Y * 0.5, and df1$Y * 0.5.
Then, I would proceed to the next State. The end result should be:
df2 <- data.frame(State=c("AL", "AL","AL","AK","AK","AK"),
X=c(1,0,2,3,2,4), y=c(2,1,1,4,2,2))
Does anyone know how i might approach this? I have no idea where to even begin... I can imagine some kind of for loop, but I'm hoping there's a more elegant approach in R.
base R solution:
funcs.X <- list(function(x) x, function(x) x-1, function(x) x+1)
funcs.y <- list(function(y) y, function(y) y*0.5, function(y) y*0.5)
apply.funcs <- function(funcs,x) as.vector(t(sapply(funcs, function(f) f(x))))
d <- data.frame(State = rep(df1$State,each=length(funcs.X)),
X = apply.funcs(funcs.X, df1$X),
y = apply.funcs(funcs.y, df1$y)
)
identical(d,df2)
# [1] TRUE
You could try
library(data.table)
res <- setDT(df1)[,list(X=c(X, X-1, X+1), y=c(y,y*0.5, y*0.5)) , State]
all.equal(setDF(res), df2, check.attributes=FALSE)
#[1] TRUE
Apologies to the poor question title. Not too sure how to describe the problem here.
First, I have the code below.
# Data
set.seed(100)
x = matrix(runif(10000,0,1),100,100)
grpA = round(runif(100,1,5),0) # Group 1, 2, 3, 4, 5
# function
funA <-function(y, A){
X = lm(y~A)
return(X$residuals)
}
# Calculation
A = apply(x,1,function(y) funA(y,grpA))
Now, instead of having grpA, I have grpB below which the groups are different for every column. Besides looping each column, can I still use the apply to calculate this? If so, how?
My actual funA calcultion is a lot more complex and I do need to calculate funA many times so I am trying to aviod using the for loop. Thanks.
grpB = matrix(round(runif(10000,1,5),0),100,100)
First off, if your function funA does a lot of work, then using a for loop versus apply won't affect performance that much. This is because the only difference is in the overhead of looping, and most of the work is going to take place inside of funA in either case.
In fact, even if funA is simple, for and apply won't be that different performance-wise. Either way, there needs to be a loop inside of R with multiple R function calls. The real performance improvements by avoiding for loops come in situations where there is a builtin R function that performs the computation you need by looping in the underlying C code without the overhead of multiple function calls in R. Here is an illustrative example.
x<-matrix(runif(10000,0,1),100,100)
require(microbenchmark)
f1<-function(z){
ret<-rep(0,ncol(z))
for(i in 1:ncol(z)){
ret[i]<-sum(z[,i])
}
ret
}
f2<-function(z){
apply(z,2,sum)
}
identical(f1(x),f2(x))
# [1] TRUE
identical(f1(x),colSums(x))
# [1] TRUE
microbenchmark(f1(x),f2(x),colSums(x))
# unit: microseconds
# expr min lq median uq max neval
# f1(x) 559.934 581.4775 596.4645 622.1425 773.519 100
# f2(x) 484.265 512.1570 526.5700 546.5010 1100.540 100
# colSums(x) 23.844 25.7915 27.0675 28.7575 59.485 100
So, in your situation, I wouldn't worry about using a for loop. There are ways to avoid a loop, for example, something like
sapply(1:ncol(x),function(i) fun(x[,i],y[,i]))
But it won't be much faster than a for loop.
Just as an answer to
can I still use the apply to calculate this? If so, how?
The answer is yes. You can combine x and grpB into an array and then use apply on the resulting array.
# Data
set.seed(100)
x = matrix(runif(10000,0,1),100,100)
grpA = round(runif(100,1,5),0) # Group 1, 2, 3, 4, 5
# function
funA <-function(y, A){
X = lm(y~A)
return(X$residuals)
}
# Original calculation
A <- apply(x, 1, funA, grpA)
# the array in this case
arr <- array(c(x, matrix(rep(grpA, 100), nrow=100, byrow=TRUE)), dim=c(nrow(x), ncol(x), 2))
# the new calculation
res <- apply(arr, 1, function(y) funA(y[, 1], y[, 2]))
# comparing results
all.equal(A, res)
## TRUE
#
# and for the new groupB
grpB = matrix(round(runif(10000,1,5),0),100,100)
# the array
arr <- array(c(x, grpB), dim=c(nrow(x), ncol(x), 2))
# the calculation (same as above)
res <- apply(arr, 1, function(y) funA(y[, 1], y[, 2]))
See #mrip's answer for the reasons this may not be a good idea.
You could easily use a sequence of the number of columns as an "indicator" or "extracting" variable, and use vapply instead of apply, like this:
vapply(sequence(ncol(x)),
function(z) funA(x[, z], grpB[, z]),
numeric(nrow(x)))
I really like using the frame syntax in R. However, if I try to do this with apply, it gives me an error that the input is a vector, not a frame (which is correct). Is there a similar function to mapply which will let me keep using the frame syntax?
df = data.frame(x = 1:5, y = 1:5)
# This works, but is hard to read because you have to remember what's
# in column 1
apply(df, 1, function(row) row[1])
# I'd rather do this, but it gives me an error
apply(df, 1, function(row) row$x)
Youcab't use the $ on an atomic vector, But I guess you want use it for readability. But you can use [ subsetter.
Here an example. Please provide a reproducible example next time. Question in R specially have no sense without data.
set.seed(1234)
gidd <- data.frame(region=sample(letters[1:6],100,rep=T),
wbregion=sample(letters[1:6],100,rep=T),
foodshare=rnorm(100,0,1),
consincPPP05 = runif(100,0,5),
stringsAsFactors=F)
apply(gidd, ## I am applying it in all the grid here!
1,
function(row) {
similarRows = gidd[gidd$wbregion == row['region'] &
gidd$consincPPP05 > .8 * as.numeric(row['consincPPP05']),
]
return(mean(similarRows$foodshare))
})
Note that with apply I need to convert to a numeric.
You can also use plyr or data.table for a clean syntax , for example:
apply(df,1,function(row)row[1]*2)
is equivalent to
ddply(df, 1, summarise, z = x*2)