Vectorizing R code, which already uses sequence notation - r

I am struggling to write fast code to compute a function of the following vector:
Currently I code it using for loop, which is very slow:
rho <- 0.9
E_D <- numeric(100)
E_D[1] <- 1
for (t in 2:100){
summm <- sum(cumsum(0.9^(0:(t-2)))^2)
E_D[t] <- t+exp(summm)
}
summm is the element of the vector I analytically defined in the picture above. E_D is a vector, which is some function of that vector. If I set maximum t to 5000, then the code above runs for more than 1 sec on my machine, which is too slow for my purposes.
I tried data.table solution, but it can not accommodate intermediate vector output within a cell:
tempdt <- data.table(prd=2:100 ,summm=0)
tempdt[, summm:=sum(cumsum(rho^(0:(prd-2)))^2)]
Warning message:
In 0:(prd - 2) : numerical expression has 99 elements: only the first used
How to make the code above faster? Please do not tell me that I have to do it in Matlab...
EDIT: To clarify, I need to compute the following vector:

Maybe something like:
n <- 100L
cp <- cumprod(rep(0.9, n - 1L)) / 0.9
cssq <- cumsum(cp)^2
cumsum(cssq)
truncated output:
[1] 1.00000 4.61000 11.95410 23.78082 40.55067 62.50542 89.72283 122.15959 ...

The : is not vectorized. We may need to either loop with sapply/lapply or do a group by row
library(data.table)
tempdt[, summm := sum(cumsum(rho^(0:(prd-2)))^2), seq_len(nrow(tempdt))]
head(tempdt)
# prd summm
#1: 2 1.00000
#2: 3 4.61000
#3: 4 11.95410
#4: 5 23.78082
#5: 6 40.55067
#6: 7 62.50542

Related

Is there a way to obtain and store positions of matrix element in JAGS?

I am developing a bayesian hierarchical model in R with BUGS code in JAGS.
In my model, I have two matrices that contain relevant information about each another in the same exact matrix position. My information is structured by rows. I apply a mathematical operation to the first matrix, Distmat, by row:
diffmat[i,j] <- abs(Distmat[birthterr[i],j] - Dist[i])
I am interested to record the column position of every minimum value in each row of diffmat in a new vector, to then apply this vector to the second matrix. This would be relatively easy in regular R code using functions "which" or "which.min":
a <- numeric()
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
a[i] <- which.min(diffmat[i,])
And then apply vector "a" to the second matrix (terrmat) to obtain the values associated with Distmat positions:
b <- numeric(0)
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
b[i] <- terrmat[i, a[i]]
However, apparently BUGS code does not recognize either which or which.min(), and I am struggling to find a way to store these matrix row positions in vectors. Perhaps there is a very simple solution to this, but I really got stuck there. Hope my info was enough clear.
Any suggestions would be very appreciated. Thanks for your time!
Here's a minimal working example. The analogs here are that x would be your diffmat. I'm drawing it at random here, but it should still work if you're defining it otherwise. Essentially, you rank the values of x in each row and make a new matrix e that is a dummy matrix that is coded 1 if x[i,j] is ranked 1 and 0 otherwise. Then you take the inner product of that and a vector of values from 1:ncol(terrmat) assuming terrmat and diffmat are of the same dimensions. Then that gives you the column index of the first ranked value for observation i. The ymat in the example below is where your terrmat would go. I think it will be pretty slow on any real-sized problem, but it appears to work from the output below.
dl <- list(
ymat = matrix(1:3, ncol=3, nrow=5, byrow=TRUE),
yinds = 1:3
)
mods <- "model{
for(i in 1:5){
for(j in 1:3){
x[i,j] ~ dnorm(0,1)
e[i,j] <- equals(rx[i,j], 1)
}
rx[i,1:3] <- rank(x[i,1:3])
ind[i] <- inprod(e[i,], yinds)
yval[i] <- ymat[i,ind[i]]
}
}"
library(runjags)
out <- run.jags(mods, data=dl, monitor="yval")
out
#
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# yval[1] 1 2 3 1.9973 0.8139 2 0.0058421 0.7 19409 -0.0077146 0.99996
# yval[2] 1 2 3 2.0067 0.81605 3 0.0057704 0.7 20000 0.00049096 1.0003
# yval[3] 1 2 3 1.9895 0.8142 2 0.0057573 0.7 20000 0.00066309 1
# yval[4] 1 2 3 1.9973 0.81638 1 0.0057727 0.7 20000 -0.00040016 0.99998
# yval[5] 1 2 3 1.993 0.81611 1 0.0057708 0.7 20000 -0.0027988 0.99996
#
# Total time taken: 0.7 seconds

How to use with() function in R instead of apply()

I am trying to optimise a code that I have written using the apply() and similar functions (e.g. lapply()). Unfortunately I do not see much of improvement so searching I came across this post apply() is slow - how to make it faster or what are my alternatives? where a suggestion is to use the function with() instead of apply() which is certainly much faster.
What I want to do is to apply a user defined function to every row of a matrix. This function takes as input the data from the row, makes some calculations and returns a vector with the results.
A toy example where I use the apply() function, the with() and a vectorized version:
#Generate a matrix 10x3
prbl1=matrix(runif(30),nrow=10)
prbl2=data.frame(prbl1)
prbl3=prbl2
#function for the apply()
fn1=function(row){
x=row[1]
y=row[2]
z=row[3]
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#function for the with()
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#Vectorise fn2
fn3=Vectorize(fn2)
#apply the functions:
rslt1=t(apply(prbl1,1,fn1))
rslt2=t(with(prbl2,fn2(X1,X2,X3)))
rslt2=cbind(rslt2[1:10],rslt2[11:20],rslt2[21:30])
rslt3=t(with(prbl3,fn3(X1,X2,X3)))
All three produce the same output, a matrix 10x3 which is what I want. Nevertheless, notice at rslt2 that I need to bind the results as the output of using with() is a vector of length 300. I suspected that this is due to the fact that the function is not vectorised (if I understood this correctly). In rslt3 I am using a vectorised version of fn2 which generated the output in the expected way.
When I compare the performance of the three, I get:
library(rbenchmark)
benchmark(rslt1=t(apply(prbl1,1,fn1)),
rslt2=with(prbl2,fn2(X1,X2,X3)),
rslt3=with(prbl3,fn3(X1,X2,X3)),
replications=1000000)
test replications elapsed relative user.self sys.self user.child sys.child
1 rslt1 1000000 103.51 7.129 102.63 0.02 NA NA
2 rslt2 1000000 14.52 1.000 14.41 0.01 NA NA
3 rslt3 1000000 123.44 8.501 122.41 0.05 NA NA
where with() without vectorisation is definitely faster.
My question: Since rslt2 is the most efficient approach, is there a way that I can use this correctly without the need to bind the results afterwards? It does the job but I feel is not efficient coding.
The first and third functions you give are being applied 1 row at a time, so are called 10 times in your example. The second function is taking advantage of the fact that multiplication and addition in R are already vectorised and so using any form of loop or ply function is unnecessary. The function is only called once. If you wanted to use your current code, all you'd need to do is change the c to cbind in fn2.
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(cbind(k1,k2,k3))
}
All that with does is evaluate the expression it's given in the list, data.frame or environment given. So with(prbl2,fn2(X1,X2,X3)) is entirely equivalent to fn2(prbl2$X1, prbl2$X2, prbl2$X3).
Is this your real function? If it is, then problem solved. If not, then it depends on whether your real function consists entirely of operations and functions that already are vectorised or can be replaced with vectorised equivalents.
For the amended function per the comments:
Single row:
fn1 <- function(row){
x <- row[1]
y <- row[2]
z <- row[3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
if (k1>0 & k2>0 &k3>0){
return(cbind(k1,k2,k3))
} else {
k1 <- 5*x+3*y+4*z
k2 <- 5*x*3*y*4*z
k3 <- 5*x*y+3*x*z
if (k1<0 || k2<0 || k3<0) {
return(cbind(0,0,0))
} else {
return(cbind(k1,k2,k3))
}
}
}
Whole matrix:
fn2 <- function(mat) {
x <- mat[, 1]
y <- mat[, 2]
z <- mat[, 3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
l1 <- 5*x+3*y+4*z
l2 <- 5*x*3*y*4*z
l3 <- 5*x*y+3*x*z
out <- array(0, dim = dim(mat))
useK <- k1 > 0 & k2 > 0 & k3 > 0
useL <- !useK & l1 >= 0 & l2 >= 0 & l3 >= 0
out[useK, ] <- cbind(k1, k2, k3)[useK, ]
out[useL, ] <- cbind(l1, l2, l3)[useL, ]
out
}

How to apply a function to each element of a vector in R

Let's say I want to multiply each even element of a vector by 2 and each odd element of a vector by 3. Here is some code that can do this:
v <- 0:10
idx <- v %% 2 == 0
v[idx] <- v[idx] * 2
v[!idx] <- v[!idx] * 3
This would get difficult if I had more than two cases. It seems like the apply family of functions never deals with vectors so I don't know a better way to do this problem. Maybe using an apply function would work if I made transformations on the data, but it seems like that shouldn't be something that I would need to do to solve this simple problem.
Any ideas?
Edit: Sorry for the confusion. I am not specifically interested in the "%%" operator. I wanted to put some concrete code in my question, but, based on the responses to the question, was too specific. I wanted to figure out how to apply some arbitrary function to each member of the list. This was not possible with apply() and I thought sapply() only worked with lists.
You can do:
v <- v * c(2, 3)[v %% 2 + 1]
It is generalizable to any v %% n, e.g.:
v <- v * c(2, 3, 9, 1)[v %% 4 + 1]
Also it does not require that length(v) be a multiple of n.
You can use vector multiplication to do what you want:
tmp <- 1:10
tmp * rep(c(3,2), length(tmp)/2)
This is easy to extend to three or more cases:
tmp * rep(c(3,2,4), length(tmp)/3)
Easiest would be:
v*c(2,3) # as suggested by flodel in a comment.
The term to search for in the documentation is "argument recycling" ... a feature of the R language. Only works for dyadic infix functions (see ?Ops). For non-dyadcic vectorized functions that would not error out with some of the arguments and where you couldn't depend on the structure of "v" to be quite so regular, you could use ifelse:
ifelse( (1:length(v)) %% 2 == 0, func1(v), func2(v) )
This constructs two vectors and then chooses elements in the first or second based on the truth value of hte first argument. If you were trying to answer the question in the title of your posting then you should look at:
?sapply
Here is an answer allowing any set of arbitrary functions to be applied to defined groups within a vector.
# source data
test <- 1:9
# categorisations of source data
cattest <- rep(1:3,each=3)
#[1] 1 1 1 2 2 2 3 3 3
Make the function to differentially apply functions:
categ <- function(x,catg) {
mapply(
function(a,b) {
switch(b,
a * 2,
a * 3,
a / 2
)
},
x,
catg
)
}
# where cattest = 1, multiply by 2
# where cattest = 2, multiply by 3
# where cattest = 3, divide by 2
The result:
categ(test,cattest)
#[1] 2.0 4.0 6.0 12.0 15.0 18.0 3.5 4.0 4.5

Efficient subsetting in R using 2 dataframes

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.
EDIT:
Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.
#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")
#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")
# my range around the points of interset
i<-3
#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
EDIT:
I updated the values to better reflect my usecase, and I see #mrdwab 's solution pulling ahead unexpectedly and by a wide margin.
I am using benchmark code from #mrdwab and the initialization is as follows:
set.seed(1)
full <- data.frame(
dt = 1:15000000,
val = floor(rnorm(15000000,0,1))
)
test <- data.frame(dt = floor(runif(24,1,15000000)))
i <- 500
The benchmarks are:
test replications elapsed relative
2 mrdwab 2 1.31 1.00000
3 spacedman 2 69.06 52.71756
1 andrie 2 93.68 71.51145
4 original 2 114.24 87.20611
Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.
Important: As #mrdwab notes below, his solution works only if the vectors are integers. If not, #spacedman has the right solution
Here's a real R way to do it. Functionally. No loops...
Starting with Andrie's example data.
First, an interval comparison function:
> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
An OR composition function:
> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
Now there's sort of a loop here, to construct a list of those comparison functions:
> funs = mapply(cf,test$dt-i,test$dt+i)
Now combine all those into one function:
> anyF = Reduce(OR,funs)
And now we apply the OR composition to our interval testing functions:
> head(full[anyF(full$dt),])
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
What you've got now is a function of a single variable that tests if the value is in the ranges you defined.
> anyF(1:10)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I don't know if this is faster, or better, or what. Someone do some benchmarks!
I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:
subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]
I had to adjust your "3" to "2" since x would be included both ways.
Benchmarking (just for fun)
#Spacedman leads the way!
First, the required data and functions.
## Data
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)
Second, the benchmarking.
## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind,
lapply(test$dt,
function(j) full[full$dt > (j-i) &
full$dt < (j+i), ])),
mrdwab = {subs <- apply(test, 1,
function(x) c((x-(i-1)):(x+(i-1))))
full[which(full$dt %in% subs), ]},
spacedman = full[anyF(full$dt),],
original = {subs <- data.frame()
for (j in test$dt)
subs <- rbind(subs,
subset(full, full$dt > (j-i) &
full$dt < (j+i)))},
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
# test replications elapsed relative
# 3 spacedman 100 0.064 1.000000
# 2 mrdwab 100 0.105 1.640625
# 1 andrie 100 0.520 8.125000
# 4 original 100 1.080 16.875000
There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.
But here is more R-ish way to do it, which might well be faster:
do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
PS: You can significantly simplify your reproducible example:
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
xx <- do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
head(xx)
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
one more way using data.tables:
{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}
benchmarks:
number of rows in test:9
number of rows in full:10000
test replications elapsed relative
1 spacedman 100 0.406 1.000
2 new 100 1.179 2.904
number of rows in full:100000
test replications elapsed relative
2 new 100 2.374 1.000
1 spacedman 100 3.753 1.581

Count the number of valid observations (no NA) pairwise in a data frame

Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.
The following is slightly faster:
x <- !is.na(Df)
t(x) %*% x
# test elapsed relative
# cor(Df) 12.345 1.000000
# t(x) %*% x 20.736 1.679708
I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){
shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}
benchmark(
shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]},
cor(Df,use='pairwise'),
columns=c('test','elapsed','relative'),
replications=100
)
#--------------
test elapsed relative
3 { 0.008 1.0
4 cor(Df, use = "pairwise") 0.020 2.5
2 sharedcount(Df) 0.092 11.5
1 shrcnt(Df) 0.036 4.5
Based on the lovely trick of Vincent and the additional lower.tri() suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :
sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
counts <- crossprod(!is.na(x))
id <- lower.tri(counts)
count <- counts[id]
X1 <- colnames(counts)[col(counts)[id]]
X2 <- rownames(counts)[row(counts)[id]]
data.frame(X1,X2,count)
}
Note the use of crossprod(), as that one gives a small improvement compared to %*%, but it does exactly the same.
The timings :
> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+ replications=5,
+ columns=c('test','replications','elapsed','relative'))
test replications elapsed relative
1 sharedcount(TestDf) 5 10.00 90.90909
2 sharedcount2(TestDf) 5 0.11 1.00000
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.

Resources