vector binding in R - r

I would like to implement a simulation program, which requires the following structure:
It has a for loop, the program will generate an vector in each iteration. I need each generated vector is appended to the existing vector.
I do not how how to do this in R. Thanks for the help.

These answers work, but they all require a call to a non-deterministic function like sample() in the loop. This is not loop-invariant code (it is random each time), but it can still be moved out of the for loop. The trick is to use the n argument and generate all the random numbers you need beforehand (if your problem allows this; some may not, but many do). Now you make one call rather than n calls, which matters if your n is large. Here is a quick example random walk (but many problems can be phrased this way). Also, full disclosure: I haven't had any coffee today, so please point out if you see an error :-)
steps <- 30
n <- 100
directions <- c(-1, 1)
results <- vector('list', n)
for (i in seq_len(n)) {
walk <- numeric(steps)
for (s in seq_len(steps)) {
walk[s] <- sample(directions, 1)
}
results[[i]] <- sum(walk)
}
We can rewrite this with one call to sample():
all.steps <- sample(directions, n*steps, replace=TRUE)
dim(all.steps) <- c(n, steps)
walks <- apply(all.steps, 1, sum)
Proof of speed increase (n=10000):
> system.time({
+ for (i in seq_len(n)) {
+ walk <- numeric(steps)
+ for (s in seq_len(steps)) {
+ walk[s] <- sample(directions, 1)
+ }
+ results[[i]] <- sum(walk)
+ }})
user system elapsed
4.231 0.332 4.758
> system.time({
+ all.steps <- sample(directions, n*steps, replace=TRUE)
+ dim(all.steps) <- c(n, steps)
+ walks <- apply(all.steps, 1, sum)
+ })
user system elapsed
0.010 0.001 0.012
If your simulation needs just one random variable per simulation function call, use sapply(), or better yet the multicore package's mclapply(). Revolution Analytics's foreach package may be of use here too. Also, JD Long has a great presentation and post about simulating stuff in R on Hadoop via Amazon's EMR here (I can't find the video, but I'm sure someone will know).
Take home points:
Preallocate with numeric(n) or vector('list', n)
Push invariant code out of for loops. Cleverly push stochastic functions out of code with their n argument.
Try hard for sapply() or lapply(), or better yet mclapply.
Don't use x <- c(x, rnorm(100)). Every time you do this, a member of R-core kills a puppy.

Probably the best thing you can do is preallocate a list of length n (n is number of iterations) and flatten out the list after you're done.
n <- 10
start <- vector("list", n)
for (i in 1:n) {
a[[i]] <- sample(10)
}
start <- unlist(start)
You could do it the old nasty way. This may be slow for larger vectors.
start <- c()
for (i in 1:n) {
add <- sample(10)
start <- c(start, add)
}

x <- rnorm(100)
for (i in 100) {
x <- c(x, rnorm(100))
}
This link should be useful: http://www.milbo.users.sonic.net/ra/

Assuming your simulation function -- call it func -- returns a vector with the same length each time, you can store the results in the columns of a pre-allocated matrix:
sim1 <- function(reps, func) {
first <- func()
result <- matrix(first, nrow=length(first), ncol=reps)
for (i in seq.int(from=2, to=reps - 1)) {
result[, i] <- func()
}
return(as.vector(result))
}
Or you could express it as follows using replicate:
sim2 <- function(reps, func) {
return(as.vector(replicate(reps, func(), simplify=TRUE)))
}
> sim2(3, function() 1:3)
[1] 1 2 3 1 2 3 1 2 3

Related

good way to speedup my R code from for loop

I have a long vector, say x with length of 1e6 and a same length weight vector, w. I want to find a small number (i.e., a scalar value) which will be added to each element of x, and make my expression value, shown in the code part below, as small as possible.
I tried using a vector from -1 to 1 by = 0.001 and using for loop to get the minimal result of my expression, but my solution is a good way to do since I will repeat the same operation 100 times or more (sometimes, the x length arrive to 1e7 or more), which take long time to finish.
getSigmoid <- function(x) {return(1 / (1 + exp(-x)))}
x <- rnorm(1e6)
w <- rnorm(1e6)
pool <- seq(-1, 1, by = 0.001)
npool <- length(pool)
result <- rep(NA, times = npool)
stime <- Sys.time()
for (i in 1:npool) {
cat("i: ", i, "/", npool, "\n")
flush.console()
result[i] <- abs(sum(getSigmoid(x + pool[i]) * w) / sum(w) - 0.5)
}
etime <- Sys.time()
(spenttime <- etime - stime)
idx_min <- which.min(result)
cat("minimal value is: ", result[idx_min], "\n")
cat("solution is: ", pool[idx_min], "\n")
I hope to get a better solution (i.e., improve the computation speed) for my question. I tried to think the vecterization idea I can not figure out. I understand parallel is a method to try, but actually the code is already in the parallel function (i.e, nested parallel may be more difficult). So if someone can figure out a method which is based on the vectorization or other, that will be very helpful.
Instead of calculating the entire vector space and finding the minimum, you will need to use a better search method or an optimization routine.
Base R has the function optimize which can do this.
set.seed(1234)
x <- rnorm(1e6)
w <- rnorm(1e6)
stime <- Sys.time()
sumw<-sum(w) #Perform the calculation once and store
#create functions:
getSigmoid <- function(x) {return(1 / (1 + exp(-x)))}
f <-function(pool) {
abs(sum(getSigmoid(x + pool) * w) / sumw - 0.5)
}
#optimize function performs the search
print(optimize(f, c(-1, 1), tol = 0.00001))
etime <- Sys.time()
print(spenttime <- etime - stime)
Using the built-in function improves the resolution of the result and greatly improved the performance. Your algorithm took about 30 seconds on my machine, the optimize function took about 0.3 secs, about 100x improvement.
The another alternative is the non-linear minimization function: nlm. Same code above but substitute nlm(f, 0) in for the optimize function.

Nested rolling sum in vector

I am struggling to produce an efficient code to compute the vector result r result from an input vector v using this function.
r(i) = \sum_{j=i}^{i-N} [o(i)-o(j)] * exp(o(i)-o(j))
where i loops (from N to M) over the vector v. Size of v is M>>N.
Of course this is feasible with 2 nested for loops, but it is too slow for computational purposes, probably out of fashion and deprecated style...
A MWE:
for (i in c(N+1):length(v)){
csum <- 0
for (j in i:c(i-N)) {
csum <- csum + (v[i]-v[j])*exp(v[i]-v[j])
}
r[i] <- csum
}
In my real application M > 10^5 and the v vector is indeed several vectors.
I have been trying with nested applications of lapply and rollapply without success.
Any suggestion is welcome.
Thanks!
I don't know if it is any more efficient but something you can try:
r[N:M] <- sapply(N:M, function(i) tail(cumsum((v[i]-v[1:N])*exp(v[i]-v[1:N])), 1))
checking that both computations give same results, I got r with your way and r2 with mine, initializing r2 to rep(NA, M) and assessed the similarity:
all((r-r2)<1e-12, na.rm=TRUE)
# [1] TRUE
NOTE: as in #lmo answer, tail(cumsum(...), 1) can be efficiently replaced by just using sum(...):
r[N:M] <- sapply(N:M, function(i) sum((v[i]-v[1:N])*exp(v[i]-v[1:N])))
Here is a method with a single for loop.
# create new blank vector
rr <- rep(NA,M)
for(i in N:length(v)) {
rr[i] <- sum((v[i] - v[seq_len(N)]) * exp(v[i] - v[seq_len(N)]))
}
check for equality
all.equal(r, rr)
[1] TRUE
You could reduce the number of operations by 1 if you store the difference. This should add a little speed up.
for(i in N:length(v)) {
x <- v[i] - v[seq_len(N)]
rr[i] <- sum(x * exp(x))
}

Loop inside a loop in R

I am trying to create an R code that puts another loop inside of the one I've already created. Here is my code:
t <- rep(1,1000)
omega <- seq(from=1,to=12,by=1)
for(i in 1:1000){
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
remove <- 0
f <- length(t [! t %in% remove]) + 1
}
When I run this code, I get a number a trials it takes f to reach the zero vector, but I want to do 10000 iterations of this experiment.
replicate is probably how you want to run the outer loop. There's also no need for the f assignment to be inside the loop. Here I've moved it outside and converted it to simply count of the elements of t that are greater than 0, plus 1.
result <- replicate(10000, {
t <- rep(1, 1000)
omega <- 1:12
for(i in seq_along(t)) {
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
}
sum(t > 0) + 1
})
I suspect your code could be simplified in other ways as well, and also that you could just write down the distribution that you're looking for without simulation. I believe your variable of interest is just how long until you get at least one of each of the numbers 1:12, yes?
Are you just looking to run your existing loop 10,000 times, like below?
t <- rep(1,1000)
omega <- seq(from=1,to=12,by=1)
f <- rep(NA, 10000)
for(j in 1:10000) {
for(i in 1:1000){
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
remove <- 0
f[j] <- length(t [! t %in% remove]) + 1
}
}

How to vectorize triple nested loops?

I've done searching similar problems and I have a vague idea about what should I do: to vectorize everything or use apply() family. But I'm a beginner on R programming and both of the above methods are quite confusing.
Here is my source code:
x<-rlnorm(100,0,1.6)
j=0
k=0
i=0
h=0
lambda<-rep(0,200)
sum1<-rep(0,200)
constjk=0
wj=0
wk=0
for (h in 1:200)
{
lambda[h]=2+h/12.5
N=ceiling(lambda[h]*max(x))
for (j in 0:N)
{
wj=(sum(x<=(j+1)/lambda[h])-sum(x<=j/lambda[h]))/100
for (k in 0:N)
{
constjk=dbinom(k, j + k, 0.5)
wk=(sum(x<=(k+1)/lambda[h])-sum(x<=k/lambda[h]))/100
sum1[h]=sum1[h]+(lambda[h]/2)*constjk*wk*wj
}
}
}
Let me explain a bit. I want to collect 200 sum1 values (that's the first loop), and for every sum1 value, it is the summation of (lambda[h]/2)*constjk*wk*wj, thus the other two loops. Most tedious is that N changes with h, so I have no idea how to vectorize the j-loop and the k-loop. But of course I can vectorize the h-loop with lambda<-seq() and N<-ceiling(), and that's the best I can do. Is there a way to further simplify the code?
Your code can be perfectly verctorized with 3 nested sapply calls. It might be a bit hard to read for the untrained eye, but the essence of it is that instead of adding one value at a time to sum1[h] we calculate all the terms produced by the innermost loop in one go and sum them up.
Although this vectorized solution is faster than your tripple for loop, the improvement is not dramatical. If you plan to use it many times I suggest you implement it in C or Fortran (with regular for loops), which improves the speed a lot. Beware though that it has high time complexity and will scale badly with increased values of lambda, ultimatelly reaching a point when it is not possible to compute within reasonable time regardless of the implementation.
lambda <- 2 + 1:200/12.5
sum1 <- sapply(lambda, function(l){
N <- ceiling(l*max(x))
sum(sapply(0:N, function(j){
wj <- (sum(x <= (j+1)/l) - sum(x <= j/l))/100
sum(sapply(0:N, function(k){
constjk <- dbinom(k, j + k, 0.5)
wk <- (sum(x <= (k+1)/l) - sum(x <= k/l))/100
l/2*constjk*wk*wj
}))
}))
})
Btw, you don't need to predefine variables like h, j, k, wj and wk. Especially since not when vectorizing, as assignments to them inside the functions fed to sapply will create overlayered local variables with the same name (i.e. ignoring the ones you predefied).
Let`s wrap your simulation in a function and time it:
sim1 <- function(num=20){
set.seed(42)
x<-rlnorm(100,0,1.6)
j=0
k=0
i=0
h=0
lambda<-rep(0,num)
sum1<-rep(0,num)
constjk=0
wj=0
wk=0
for (h in 1:num)
{
lambda[h]=2+h/12.5
N=ceiling(lambda[h]*max(x))
for (j in 0:N)
{
wj=(sum(x<=(j+1)/lambda[h])-sum(x<=j/lambda[h]))/100
for (k in 0:N)
{
set.seed(42)
constjk=dbinom(k, j + k, 0.5)
wk=(sum(x<=(k+1)/lambda[h])-sum(x<=k/lambda[h]))/100
sum1[h]=sum1[h]+(lambda[h]/2)*constjk*wk*wj
}
}
}
sum1
}
system.time(res1 <- sim1())
# user system elapsed
# 5.4 0.0 5.4
Now let's make it faster:
sim2 <- function(num=20){
set.seed(42) #to make it reproducible
x <- rlnorm(100,0,1.6)
h <- 1:num
sum1 <- numeric(num)
lambda <- 2+1:num/12.5
N <- ceiling(lambda*max(x))
#functions for wj and wk
wjfun <- function(x,j,lambda,h){
(sum(x<=(j+1)/lambda[h])-sum(x<=j/lambda[h]))/100
}
wkfun <- function(x,k,lambda,h){
(sum(x<=(k+1)/lambda[h])-sum(x<=k/lambda[h]))/100
}
#function to calculate values of sum1
fun1 <- function(N,h,x,lambda) {
sum1 <- 0
set.seed(42) #to make it reproducible
#calculate constants using outer
const <- outer(0:N[h],0:N[h],FUN=function(j,k) dbinom(k, j + k, 0.5))
wk <- numeric(N[h]+1)
#loop only once to calculate wk
for (k in 0:N[h]){
wk[k+1] <- (sum(x<=(k+1)/lambda[h])-sum(x<=k/lambda[h]))/100
}
for (j in 0:N[h])
{
wj <- (sum(x<=(j+1)/lambda[h])-sum(x<=j/lambda[h]))/100
for (k in 0:N[h])
{
sum1 <- sum1+(lambda[h]/2)*const[j+1,k+1]*wk[k+1]*wj
}
}
sum1
}
for (h in 1:num)
{
sum1[h] <- fun1(N,h,x,lambda)
}
sum1
}
system.time(res2 <- sim2())
#user system elapsed
#1.25 0.00 1.25
all.equal(res1,res2)
#[1] TRUE
Timings for #Backlin`s code (with 20 interations) for comparison:
user system elapsed
3.30 0.00 3.29
If this is still too slow and you cannot or don't want to use another language, there is also the possibility of parallelization. As far as I see the outer loop is embarrassingly parallel. There are some nice and easy packages for parallelization.

How to make a loop run faster in R?

I want to use arms() to get one sample each time and make a loop like the following one in my function. It runs very slowly. How could I make it run faster? Thanks.
library(HI)
dmat <- matrix(0, nrow=100,ncol=30)
system.time(
for (d in 1:100){
for (j in 1:30){
y <- rep(0, 101)
for (i in 2:100){
y[i] <- arms(0.3, function(x) (3.5+0.000001*d*j*y[i-1])*log(x)-x,
function(x) (x>1e-4)*(x<20), 1)
}
dmat[d, j] <- sum(y)
}
}
)
This is a version based on Tommy's answer but avoiding all loops:
library(multicore) # or library(parallel) in 2.14.x
set.seed(42)
m = 100
n = 30
system.time({
arms.C <- getNativeSymbolInfo("arms")$address
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
if (diff(bounds) < 1e-07) stop("pointless!")
# create the vector of z values
zval <- 0.00001 * rep(seq.int(n), m) * rep(seq.int(m), each = n)
# apply the inner function to each grid point and return the matrix
dmat <- matrix(unlist(mclapply(zval, function(z)
sum(unlist(lapply(seq.int(100), function(i)
.Call(arms.C, bounds, function(x) (3.5 + z * i) * log(x) - x,
0.3, 1L, parent.frame())
)))
)), m, byrow=TRUE)
})
On a multicore machine this will be really fast since it spreads the loads across cores. On a single-core machine (or for poor Windows users) you can replace mclapply above with lapply and get only a slight speedup compared to Tommy's answer. But note that the result will be different for parallel versions since it will use different RNG sequences.
Note that any C code that needs to evaluate R functions will be inherently slow (because interpreted code is slow). I have added the arms.C just to remove all R->C overhead to make moli happy ;), but it doesn't make any difference.
You could squeeze out a few more milliseconds by using column-major processing (the question code was row-major which requires re-copying as R matrices are always column-major).
Edit: I noticed that moli changed the question slightly since Tommy answered - so instead of the sum(...) part you have to use a loop since y[i] are dependent, so the function(z) would look like
function(z) { y <- 0
for (i in seq.int(99))
y <- y + .Call(arms.C, bounds, function(x) (3.5 + z * y) * log(x) - x,
0.3, 1L, parent.frame())
y }
Well, one effective way is to get rid of the overhead inside arms. It does some checks and calls the indFunc every time even though the result is always the same in your case.
Some other evaluations can be also be done outside the loop. These optimizations bring down the time from 54 secs to around 6.3 secs on my machine. ...and the answer is identical.
set.seed(42)
#dmat2 <- ##RUN ORIGINAL CODE HERE##
# Now try this:
set.seed(42)
dmat <- matrix(0, nrow=100,ncol=30)
system.time({
e <- new.env()
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
f <- function(x) (3.5+z*i)*log(x)-x
if (diff(bounds) < 1e-07) stop("pointless!")
for (d in seq_len(nrow(dmat))) {
for (j in seq_len(ncol(dmat))) {
y <- 0
z <- 0.00001*d*j
for (i in 1:100) {
y <- y + .Call("arms", bounds, f, 0.3, 1L, e)
}
dmat[d, j] <- y
}
}
})
all.equal(dmat, dmat2) # TRUE
why not like this?
dat <- expand.grid(d=1:10, j=1:3, i=1:10)
arms.func <- function(vec) {
require(HI)
dji <- vec[1]*vec[2]*vec[3]
arms.out <- arms(0.3,
function(x,params) (3.5 + 0.00001*params)*log(x) - x,
function(x,params) (x>1e-4)*(x<20),
n.sample=1,
params=dji)
return(arms.out)
}
dat$arms <- apply(dat,1,arms.func)
library(plyr)
out <- ddply(dat,.(d,j),summarise, arms=sum(arms))
matrix(out$arms,nrow=length(unique(out$d)),ncol=length(unique(out$j)))
However, its still single core and time consuming. But that isn't R being slow, its the arms function.

Resources