Assume following simple bootstrap procedure:
x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45,19,56,874,3,56,89,12,28,71,93)
n <- length(x)
nBoot <- 5; mn <- numeric(nBoot)
for(boots in 1:nBoot){
set.seed(830219+boots)
repl <- sample(x,n,replace=TRUE)
mn[boots] <- mean(repl)
}
Is there a way that I can view the resampled dataset 'repl' for each of the 5 replications?
I would appreciate an answer very much. Many thanks in advance
EDIT
I tried the following:
x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45, 19,56,874,3,56,89,12,28,71,93)
n <- length(x)
nBoot <- 5; mn <- numeric(nBoot)
for(boots in 1:nBoot){
set.seed(830219+boots)
repl <- sample(x,n,replace=TRUE)
print(repl)
mn[boots] <- mean(repl)
}
This allows me to view each of the 5 resampled datasets, but does not allow me to work with each dataset seperately as repl[1], repl[2],...
EDIT2
I tried following:
x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45,19,56,874,3,56,89,12,28,71,93)
n <- length(x)
nBoot <-3; mn <- numeric(nBoot); repl <- x
for(boots in 1:nBoot){
set.seed(830219+boots)
repl[boots] <- sample(x, n, replace=TRUE)
pr <- print(repl)
mn[boots] <- mean(repl)
}
I then however get 5 warning messages: 'In repl[boots] <- sample(x, n, replace = TRUE) :
number of items to replace is not a multiple of replacement length'
and calling repl[1] only gives me one number
Based on your comments, I've fixed the code. Here's the version that I tested and it seems to work:
x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45,19,56,874,3,56,89,12,28,71,93)
n <- length(x)
nBoot <-3; mn <- numeric(nBoot)
repl <- matrix(x, nrow=nBoot, ncol=length(x))
for (boots in 1:nBoot) {
set.seed(830219+boots)
repl[boots, ] <- sample(x, n, replace=TRUE)
pr <- print(repl)
mn[boots] <- mean(repl)
}
Related
Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks
Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)
I have a numeric data.frame df with 134946 rows x 1938 columns.
99.82% of the data are NA.
For each pair of (distinct) columns "P1" and "P2", I need to find which rows have non-NA values for both and then do some operations on those rows (linear model).
I wrote a script that does this, but it seems quite slow.
This post seems to discuss a related task, but I can't immediately see if or how it can be adapted to my case.
Borrowing the example from that post:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
My script is:
tic = proc.time()
out <- do.call(rbind,sapply(1:(N_ps-1), function(i) {
if (i/10 == floor(i/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
do.call(rbind,sapply((i+1):N_ps, function(j) {
w <- which(complete.cases(df[,i],df[,j]))
N <- length(w)
if (N >= 5) {
xw <- df[w,i]
yw <- df[w,j]
if ((diff(range(xw)) != 0) & (diff(range(yw)) != 0)) {
s <- summary(lm(yw~xw))
o <- c(i,j,N,s$adj.r.squared,s$coefficients[2],s$coefficients[4],s$coefficients[8],s$coefficients[1],s$coefficients[3],s$coefficients[7])} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
},simplify=F))
}
,simplify=F))
toc = proc.time();
show(toc-tic);
This takes about 10 minutes on my machine.
You can imagine what happens when I need to handle a much larger (although more sparse) data matrix. I never managed to finish the calculation.
Question: do you think this could be done more efficiently?
The thing is I don't know which operations take more time (subsetting of df, in which case I would remove duplications of that? appending matrix data, in which case I would create a flat vector and then convert it to matrix at the end? ...).
Thanks!
EDIT following up from minem's post
As shown by minem, the speed of this calculation strongly depended on the way linear regression parameters were calculated. Therefore changing that part was the single most important thing to do.
My own further trials showed that: 1) it was essential to use sapply in combination with do.call(rbind, rather than any flat vector, to store the data (I am still not sure why - I might make a separate post about this); 2) on the original matrix I am working on, much more sparse and with a much larger nrows/ncolumns ratio than the one in this example, using the information on the x vector available at the start of each i iteration to reduce the y vector at the start of each j iteration increased the speed by several orders of magnitude, even compared with minem's original script, which was already much better than mine above.
I suppose the advantage comes from filtering out many rows a priori, thus avoiding costly xna & yna operations on very long vectors.
The modified script is the following:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.90)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x))
dl <- as.list(df)
rl <- sapply(1:(N_ps - 1), function(i) {
if ((i-1)/10 == floor((i-1)/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
x <- dl[[i]]
xna <- which(naIds[[i]])
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]][xna]
yna <- which(naIds[[j]][xna])
w <- xna[yna]
N <- length(w)
if (N >= 5) {
xw <- x[w]
yw <- y[yna]
if ((min(xw) != max(xw)) && (min(yw) != max(yw))) {
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic)
E.g. try with nr=100000; nc=100.
I should probably mention that I tried using indices, i.e.:
naIds <- lapply(df, function(x) which(!is.na(x)))
and then obviously generating w by intersection:
w <- intersect(xna,yna)
N <- length(w)
This however is slower than the above.
Larges bottleneck is lm function, because there are lot of checks & additional calculations, that you do not necessarily need. So I extracted only the needed parts.
I got this to run in +/- 18 seconds.
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x)) # outside loop
dl <- as.list(df) # sub-setting list elements is faster that columns
rl <- sapply(1:(N_ps - 1), function(i) {
x <- dl[[i]]
xna <- naIds[[i]] # relevant logical vector if not empty elements
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]]
yna <- naIds[[j]]
w <- xna & yna
N <- sum(w)
if (N >= 5) {
xw <- x[w]
yw <- y[w]
if ((min(xw) != max(xw)) && (min(xw) != max(xw))) { # faster
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,6))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic);
# user system elapsed
# 17.94 0.11 18.44
i'm using this code:
library("partitions")
x <- c(2,4,6)
parts <- listParts(length(x))
out <- rapply(parts, function(ii) x[ii], how="replace")
to calculate list vector of all partition, but i would be like list of partition with k dimension, for example:
k=2
{(2),(4,6)}{(4),(2,6)}{(6),(2,4)}
Maybe there are better ways of doing this but the following does what you want.
library(partitions)
funParts <- function(x, k){
parts <- listParts(length(x))
res <- lapply(parts, function(inx) sapply(inx, function(i) x[i]))
res <- unlist(res, recursive = FALSE)
res <- res[sapply(res, length) <= k]
unique(res)
}
x <- c(2,4,6)
k <- 2
funParts(x, 2)
funParts(x, 1)
funParts(4:10, 3)
I'm trying to create a simulation to calculate the confidence interval for a binomial proportion. So far I have a function that calculates the lower and upper bounds and I have generated and stored the type of data I want (in a matrix, I'm not sure about that).
How can I create a loop that generates samples with different sizes. I'd like to test how the formula performs when calculating the intervals with sample sizes n=10, 11, 12,... up to 100.
My code so far:
## functions that calculate lower and upper bounds
ll <- function(x, cl=0.95) {
n <- length(x)
p.est <- mean(x)
z = abs(qnorm((1-cl)/2))
return((p.est) - z*sqrt(p.est*(1-p.est)/n))
}
ul <- function(x, cl=0.95) {
n <- length(x)
p.est <- mean(x)
z = abs(qnorm((1-cl)/2))
return((p.est) + z*sqrt(p.est*(1-p.est)/n))
}
## my simulation for n=10 and 200 repetitions.
p <- 0.4
n <- 10
rep <- 200
dat <- rbinom(rep*n,1,p)
x <- matrix(dat, ncol=rep)
ll.res <- apply(x, 2, ll)
ul.res <- apply(x, 2, ul)
hits <- ll.res <= p & p <= ul.res
sum(hits==1)/rep
I'm not sure which values do you want to compare between different sample sizes. But I guess wrapping your simulation in a for and using lists to store the results should work:
nrep=200
hits=list()
value=NULL
ll.res = list()
ul.res = list()
ns = c(10:100)
for(i in 1:length(ns)){
p <- 0.4
n <- ns[i]
rep <- 200
dat <- rbinom(rep*n,1,p)
x <- matrix(dat, ncol=nrep)
ll.res[[i]] <- apply(x, 2, ll)
ul.res[[i]] <- apply(x, 2, ul,cl=0.95)
hits[[i]] <- ll.res[[i]] <= p & p <= ul.res[[i]]
value[i] = sum(hits[[i]]==1)/rep
}
basically i want to perform diagonal averaging in R. Below is some code adapted from the simsalabim package to do the diagonal averaging. Only this is slow.
Any suggestions for vectorizing this instead of using sapply?
reconSSA <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
.intFun <- function(i,x,ind) mean(x[ind==i])
RC <- sapply(1:N,.intFun,x=XX,ind=IND)
return(RC)
}
For data you could use the following
data(AirPassengers)
v <- AirPassengers
L <- 30
T <- length(v)
K <- T-L+1
x.b <- matrix(nrow=L,ncol=K)
x.b <- matrix(v[row(x.b)+col(x.b)-1],nrow=L,ncol=K)
S <- eigen(x.b %*% t(x.b))[["vectors"]]
out <- reconSSA(S, v, 1:10)
You can speed up the computation by almost 10 times with the help of a very specialized trick with rowsum:
reconSSA_1 <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
SUMS <- rowsum.default(c(XX), c(IND))
counts <- if(L <= K) c(1:L, rep(L, K-L-1), L:1)
else c(1:K, rep(K, L-K-1), K:1)
c(SUMS/counts)
}
all.equal(reconSSA(S, v, 1:10), reconSSA_1(S, v, 1:10))
[1] TRUE
library(rbenchmark)
benchmark(SSA = reconSSA(S, v, 1:10),
SSA_1 = reconSSA_1(S, v, 1:10),
columns = c( "test", "elapsed", "relative"),
order = "relative")
test elapsed relative
2 SSA_1 0.23 1.0000
1 SSA 2.08 9.0435
[Update: As Joshua suggested it could be speed up even further by using the crux of the rowsum code:
reconSSA_2 <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- c(row(XX)+col(XX)-1L)
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- c(S[,group] %*% t(t(XX) %*% S[,group]))
##Diagonal Averaging
SUMS <- .Call("Rrowsum_matrix", XX, 1L, IND, 1:N,
TRUE, PACKAGE = "base")
counts <- if(L <= K) c(1:L, rep(L, K-L-1), L:1)
else c(1:K, rep(K, L-K-1), K:1)
c(SUMS/counts)
}
test elapsed relative
3 SSA_2 0.156 1.000000
2 SSA_1 0.559 3.583333
1 SSA 5.389 34.544872
A speedup of x34.5 comparing to original code!!
]
I can't get your example to produce sensible results. I think there are several errors in your function.
XX is used in sapply, but is not defined in the function
sapply works over 1:N, where N=144 in your example, but x.b only has 115 columns
reconSSA simply returns x
Regardless, I think you want:
data(AirPassengers)
x <- AirPassengers
rowMeans(embed(x,30))
UPDATE: I've re-worked and profiled the function. Most of the time is spent in mean, so it may be hard to get this much faster using R code. That said, you can 20% speedup by using sum instead.
reconSSA <- function(S,v,group=1){
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
.intFun <- function(i,x,ind) {
I <- ind==i
sum(x[I])/sum(I)
}
RC <- sapply(1:N,.intFun,x=XX,ind=IND)
return(RC)
}