Compute a double sum in R - r

I have to compute a test statistic with a double sum.
I solved it like this:
T<-numeric(1)
for(j in 1:n){
for(k in 1:n){
T = T + ((1/n)*(exp(-(1/2)*((Y[j]-Y[k])^2))))}
T = T - ((sqrt(2))*(exp(-(1/4)*((Y[j])^2))))}
T = T + (n*(3^(-(1/2))))
Is there an easier way to compute the test statistic?

Use
n=100;
Y=runif(100);
T=0;
Ydiff=outer(Y,Y,"-")^2;
Y_1=exp(-0.5*Ydiff);
Y_2=sqrt(2)*exp(-0.25*Y^2);
T=sum(rowMeans(Y_1)-Y_2) + (n*(3^(-(1/2))))
Comparison of methods given so far give:
T=0;
n=100;
set.seed(100)
Y=runif(100);
for(j in 1:n){
for(k in 1:n){
T = T + ((1/n)*(exp(-(1/2)*((Y[j]-Y[k])^2))));
}
T = T - ((sqrt(2))*(exp(-(1/4)*((Y[j])^2))));
}
T = T + (n*(3^(-(1/2))));
print(T)
#21.18983
T=0;
Ydiff=outer(Y,Y,"-")^2;
Y_1=exp(-0.5*Ydiff);
Y_2=sqrt(2)*exp(-0.25*Y^2);
T=sum(rowMeans(Y_1)-Y_2) + (n*(3^(-(1/2))));
print(T)
# 21.18983
T=0;
indexes = expand.grid(1:n,1:n);
T = 1/n*sum(exp(-1/2)*((Y[indexes[,1]]-Y[indexes[,2]])));
T = T-(sqrt(2))*sum(exp(-1/4*(Y[1:n])));
T = T+n/sqrt(3);
print(T)
# -66.71403

It's more useful to create the indexes in advance and then just sum over an array rather than computing new indices over two nested loops
indexes = expand.grid(1:n,1:n)
T = 1/n*sum(exp(-1/2*(Y[indexes[,1]]-Y[indexes[,2]])))
T = T-(sqrt(2))*sum(exp(-1/4*(Y[1:n])))
T = T+n/sqrt(3)
Edit: For large n, this is impractical, as an n of 1,000,000 would make a 3.7 TB data frame with expand.grid. You can always use the for loops, even if they are slow, but I would recommend using C++ if you need to have absurdly large N, because that is 1 trillion loops, which will take a very long time to compute.

Related

Sum elements in a list containing functions

I'm computing ft for values of k, and then storing them in the list funcList. Now I want to store Xt as a function of t as the sum of all elements in funclist. This is what I've done:
n = 100
funcList = list()
ft = function(t) {(abs(t) <= 1)*(1-(t)^2)+(abs(t) > 1)*(0)}
for (k in 1:100){
funcList[[k]] = ft(t+k/n)*rnorm(100,0,1)
}
Xt = rowSums(sapply(funcList, ?)
However, I'm not sure at all how I should express Xt here and how to use the functions rowSums() and sapply().
Realy not sure what you try to do.
Your code throws an error, since in your for loop there is no value defined for t.
Are you tring to do something like this:
n = 100
funcList = list(100)
ft = function(t) {(abs(t) <= 1)*(1-(t)^2)+(abs(t) > 1)*(0)}
for (k in 1:100){
funcList[[k]] = function(t) {ft(t+k/n)*rnorm(100,0,1)}
}
for a fixed t (say t = -1) you can than do :
result <- sapply(funcList,function(ft_i)ft_i(-1))
and afterwards call rowSums
result2 <- rowSums(result)
but to be honest, that sounds like a relay complicated thing to do....

How to make this R code (for loop) more efficient?

I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?
S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v)) #creating 4 S by n NA matrix
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
for (i in 1:S){
set.seed(i)
beta0 = 50
beta1 = 10
x = rnorm(n)
e.t = rt(n,v[j])
y.t = e.t + beta0 + beta1*x
func1 = function(betas){
beta0 = betas[1]
beta1 = betas[2]
sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
return((v[j]+1)/2*sum)
}
beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
beta0.lse[i,j] = lm(y.t~x)$coef[1]
beta1.lse[i,j] = lm(y.t~x)$coef[2]
}
}
The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed).
I wanted to use parallel package in R but I didn't find any useful functions.
The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.
Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.
S <- 10000
n <- 100
v <- c(5,10,50,100)
beta0 <- 50
beta1 <- 10
Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.
func1 <- function(betas, v, y.t, x){
beta0 <- betas[1]
beta1 <- betas[2]
sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
return((v+1)/2*sum)
}
Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.
values <- lapply(v, function(j) vapply(1:S, function(s) {
# This should look familiar, it is taken from your code
set.seed(s)
x <- rnorm(n)
e.t <- rt(n,j)
y.t <- e.t + beta0 + beta1*x
# Rather than running `nlm` and `lm` twice, we run it once and store the results
nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
lmmod <- lm(y.t~x)
# now we return the four values of interest
c(beta0.mle = nlmmod$estimate[1],
beta1.mle = nlmmod$estimate[2],
beta0.lse = lmmod$coef[1],
beta1.lse = lmmod$coef[2])
}, numeric(4)) # this tells `vapply` what to expect out of the function
)
Finally we can reorganize everything into the four matrices.
beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.

Storing results of loop iterations in R

I am trying to store the results of the the code below, however I could only come up with a solution to save the results of the model with the smallest sum of squared residuals. This was useful until the results were in the limits of the range of both c and gamma, therefore I need to assess the characteristics of other points. For this I need to store the results of every iteration. Does anyone know how to do this in this case?
Thanks in advance!
dlpib1 <- info$dlpib1
scale <- sqrt(var(dlpib1))
RSS.m <- 10
for (c in seq(-0.03,0.05,0.001)){
for (gamma in seq(1,100,0.2))
{
trans <- (1+exp(-(gamma/scale)*(dlpib1-c)))^-1
grid.regre <-lm(dlpib ~ dlpib1 + dlpib8 + trans + trans*dlpib1 +
+ I(trans*dlpib4) ,data=info)
coef <- grid.regre$coefficients
RSS <- sum(grid.regre$residuals^2)
if (RSS < RSS.m){
RSS.m <- RSS
gamma.m <- gamma
c.m <- c
coef.m <- coef
}
}
}
grid <- c(RSS=RSS.m,gamma=gamma.m,c=c.m,coef.m)
grid`
The easiest way to store model results by iterations is in a list:
List = list()
for(i in 1:100)
{
LM = lm(rnorm(10)~rnorm(10))
List[[length(List)+1]] = LM
}
You can probably avoid the for loop altogether. However, as for how to accomplish your task, you simply need to index whatever object you are storing the value in. For example,
# outside the for loop
trans <- list()
# inside the for loop
trans[[paste(gamma, c, sep="_")]] <- ...
I'm pretty sure to save all iterations of the RSS's you could do something like this:
dlpib1 <- info$dlpib1
scale <- sqrt(var(dlpib1))
RSS.m <- rep(0,N)
coef <- rep(0,N)
i <- 0
for (c in seq(-0.03,0.05,0.001)){
for (gamma in seq(1,100,0.2))
{
trans <- (1+exp(-(gamma/scale)*(dlpib1-c)))^-1
grid.regre <-lm(dlpib ~ dlpib1 + dlpib8 + trans + trans*dlpib1 +
+ I(trans*dlpib4) ,data=info)
coef <- grid.regre$coefficients
RSS.m[i] <- sum(grid.regre$residuals^2)
i=i+1
}
}
}

Allocating space for a sparse matrix in R

I construct a large, sparse matrix, of which I know the number non-zero elements in advance. Is it possible in R to allocate space for this matrix, instead of having its space automatically increased every time I add an element? Something like spalloc does in Matlab.
As a simplified code-example of what I want, consider the construction of the following block-wise diagonal matrix.
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
x = Matrix(0, n*q, p*q, sparse=TRUE);
for(i in 1:q) {
x[((i-1)*n+1):(i*n),((i-1)*p+1):(i*p)] = x_i;
}
I think this process would be much faster if I could tell R in advance that the matrix will contain n*p*q non-zero elements.
Thanks in advance!
Edit: I now see that for the blockwise matrix I should use bdiag()
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
lst = list();
for(i in 1:q) {
lst[i] = x_i;
}
x = bdiag(lst);
This is much faster.

How to speed this kind of double for-loop?

I am programming an expectation-maximization algorithm with R. In order to speed-up the computation, I would like to vectorize this bottleneck. I know that N is about a hundred times k.
MyLoglik = 0
for (i in c(1:N))
{
for (j in c(1:k))
{
MyLoglik = MyLoglik + MyTau[i,j]*log(MyP[j]*MyF(MyD[i,], MyMu[j,], MyS[[j]]))
}
}
There is also this list of matrices:
MyDf.list <- vector("list", k)
for(i in 1:k)
{
MyDf.list[[i]] <- matrix(0,d,d)
for (j in c(1:N))
{
MyDf.list[[i]] = MyDf.list[[i]] + MyTau[j,i]*as.numeric((MyD[j,]-MyMu[i,])) %*% t(as.numeric(MyD[j,]-MyMu[i,]))
}
MyDf.list[[i]] = MyDf.list[[i]] / MyM[i]
}
I have sped things up a bit using:
MyLoglik = 0
for (j in c(1:k))
{
MyR= apply(MyD, 1, function(x) log(MyP[j]*MyF(x, MyMu[j,], MyS[[j]])))
MyLoglik = MyLoglik + sum(MyTau[,j]*MyR)
}
and:
d = dim(MyD)[2]
MyDf.list <- vector("list", k)
for(i in 1:k)
{
MyDf.list[[i]] <- matrix(0,d,d)
MyR= apply(MyD, 1, function(x) as.numeric((x-MyMu[i,])) %*% t(as.numeric(x-MyMu[i,])))
MyDf.list[[i]] = matrix(rowSums(t(MyTau[,i]*t(MyR))) / MyM[i],d,d)
}
For the first one, I'm assuming MyF is a function you've made? If you can make sure it will take your matrices and lists as inputs and output a matrix, you could do something like:
MyLoglik = sum(MyTau%*%log(MyP)) + sum(MyTau*log(MyF(MyD, MyMu, MyS)))
For the second one, I think because you're doing it as list it will be more difficult to vectorize. Maybe instead of a list of matrices you could have a 3-dimensional array? So that MyDf.array[i,j,k] has dimensions N, d, d (or d, d, N).
I hate to even suggest this prematurely, but this is the sort of thing where building a C-extension in R might make sense. For matrices with defined (known) size (which you have here!), C-extensions aren't that hard to build, I promise! The nastiest bit here would probably be passing in 'myF'
My R-knowledge is quite out of date, but for loops (especially like this one!) used to be brutal.
Maybe timing and figuring out which part is slow would help? Is it myF? What if you change it to an identity?
You can cut down on the work done in the inner loop if things are symmetric: A[i,j] = A[j,i]

Resources