I would like to perform bootstrapping to obtain 95% Cis of my fixed effects in a binomial GLMM:
m <- glmer(cbind(df$Valid.detections, df$Missed.detections) ~ distance +
Habitat + Replicate + transmitter.depth + receiver.depth +
wind.speed + wtc + Transmitter + (1 | Unit) +
(1 | SUR.ID) + distance:Transmitter +
distance:Habitat + distance:transmitter.depth + distance:receiver.depth +
distance:wind.speed, data = df, family = binomial(link=logit),control=glmerControl(calc.derivs=F))
I found that the confint() function is able to achieve this, so I specified the function:
confint(m, method = "boot", boot.type = "basic", seed = 123, nsim = 1000)
The function had been running for more than 8 hours before I decided to terminate. Upon termination, I got returned the following warning messages (x10):
Warning messages:
1: In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
My questions are: 1) Do I have to worry about these warning messages? If so, how could I deal with them?, 2) Because after 8 hours it was still running I have no clue how long it takes to perform this function. Therefore, it would be nice to have some sort of progress bar while performing this function. I read that confint() can take arguments from bootMer, so I included the argument .progress = "txt", resulting in:
confint(m, method = "boot", boot.type = "basic", seed = 123, nsim = 1000, .progress = "txt")
but it doesn't seem to work. Alternatively, if there are better ways to achieve the same goal, I'm open to suggestions.
Thanks for any help
Here's an example:
library("lme4")
(t1 <- system.time(
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)))
## user system elapsed
## 0.188 0.000 0.186
nranpars <- length(getME(gm1,"theta"))
nfixpars <- length(fixef(gm1))
(t2 <- system.time(c1 <- confint(gm1,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
.progress="txt")))
## user system elapsed
## 221.958 0.164 222.187
Note that this progress bar is only 80 characters long, so it increments only after each 1000/80=12 bootstrap iterations. If your original model took an hour to fit then you shouldn't expect to see any progress-bar activity until 12 hours later ...
(t3 <- system.time(c2 <- confint(gm1,
parm=(nranpars+1):(nranpars+nfixpars))))
## user system elapsed
## 5.212 0.012 5.236
1000 bootstrap reps is probably overkill -- if your model fit is slow, you can probably get reasonable CIs from 200 bootstrap reps.
I tried this with optimizer="nloptwrap" as well, hoping it would speed things up. It did, although there is a drawback (see below).
(t4 <- system.time(
gm1B <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial,
control=glmerControl(optimizer="nloptwrap"))))
## user system elapsed
## 0.064 0.008 0.075
(t5 <- system.time(c3 <- confint(gm1B,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
.progress="txt",PBargs=list(style=3))))
##
## user system elapsed
## 65.264 2.160 67.504
This is much faster, but gives warnings (37 in this case) about
convergence. According to all.equal(), there was only about 2% difference in the confidence intervals calculated this way. (There are still some wrinkles to sort out in the package itself ...)
Your best bet for speeding this up will be to parallelize -- unfortunately, this way you lose the ability to use a progress bar ...
(t6 <- system.time(c4 <- confint(gm1,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
parallel="multicore", ncpus=4)))
##
## user system elapsed
## 310.355 0.916 116.917
This takes more user time (it counts the time used on all cores), but the elapsed time is cut in half. (It would be nice to do better with 4 cores but twice as fast is still good. These are virtual cores on a virtual Linux machine, real (non-virtual) cores might have better performance.)
With nloptwrap and multicore combined I can get the time down to 91 seconds (user)/ 36 seconds (elapsed).
Related
y <- cumsum(rnorm(100,0,1)) # random normal, with small (1.0) drift.
y.ts <- ts(y)
x <- cumsum(rnorm(100,0,1))
x
x.ts <- ts(x)
ts.plot(y.ts,ty= "l", x.ts) # plot the two random walks
Regression.Q1 = lm(y~x) ; summary(lm2)
summary(Regression.Q1)
t.test1 <- (summary(Regression.Q1)$coef[2,3]) # T-test computation
y[t] = y[t-1] + epsilon[t]
epsilon[t] ~ N(0,1)
set.seed(1)
t=1000
epsilon=sample(c(-1,1), t, replace = 1) # Generate k random walks across time {0, 1, ... , T}
N=T=1e3
y=t(apply(matrix(sample(c(-1,1),N*T,rep=TRUE),ncol=T),1,cumsum))
y[1]<-0
for (i in 2:t) {
y[i]<-y[i-1]+epsilon[i]
}
I need to:
Repeat the process 1,000 times (Monte Carlo simulations), namely build a loop around the previous program and each time save the t statistics. You will have a sequence of 1;000 t-tests : S = (t-test1, t-test2, ... , t-test1000). Count the number of time the absolute value of the 1,000 t-tests > 1.96, the critical value at a 5% significance level. If the series were I(0) you would have found roughly 5%. It won't be the case here (spurious regression).
What do I need to add to save the respective coefficients ?
Your posted code related to y[t] = y[t-1] + epsilon[t] is not real working code, but I can see that you are trying to store all 1000 * 2 random walk. Actually there is no need to do this. We only care about t-score rather than what those realizations of random walk are.
For this kind of problem, where we aim to replicate a procedure a lot of times, it is handy to first write a function to execute such a procedure for a single time. You already had good working code for this; we just need to wrap it in a function (removing those unnecessary part like plot):
sim <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
coef(summary(lm(y ~ x)))[2,3]
}
This function takes no input; it only returns the t-score for one experiment.
Now, we are going to repeat this 1000 times. We can write a for loop, but function replicate is easier (read ?replicate if necessary)
S <- replicate(1000, sim())
Note this will take some time, much slower than it should be for such a simple task, because both lm and summary.lm are slow. A much faster way will be shown later.
Now, S is vector with 1000 values, which is the "a sequence of 1000 t-tests" you want. To get "the number of time the absolute value of the 1,000 t-tests > 1.96", we can just use
sum(abs(S) > 1.96)
# [1] 756
The result 756 is just what I get; you will get something different as the simulation is random. But it will always be quite a large number as expected.
A faster version of sim:
fast_sim <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
y <- y - mean(y)
x <- x - mean(x)
xty <- crossprod(x,y)[1]
xtx <- crossprod(x)[1]
b <- xty / xtx
sigma <- sqrt(sum((y - x * b) ^ 2) / 98)
b * sqrt(xtx) * sigma
}
This function computes simple linear regression without lm, and t-score without summary.lm.
S <- replicate(1000, fast_sim())
sum(abs(S) > 1.96)
# [1] 778
An alternative way is to use cor.test:
fast_sim2 <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
unname(cor.test(x, y)[[1]])
}
S <- replicate(1000, fast_sim())
sum(abs(S) > 1.96)
# [1] 775
Let's have a benchmark:
system.time(replicate(1000, sim()))
# user system elapsed
# 1.860 0.004 1.867
system.time(replicate(1000, fast_sim()))
# user system elapsed
# 0.088 0.000 0.090
system.time(replicate(1000, fast_sim2()))
# user system elapsed
# 0.308 0.004 0.312
cor.test is much faster than lm + summary.lm, but manual computation is even faster!
I have a dataframe with the following dimensions:
dim(b)
[1] 974 433685
The columns represent variables that I want to run ANOVAs on (i.e., I want to run 433,685 ANOVAs). Sample size is 974. The last column is the 'group' variable.
I've come up with 3 different methods, but all are too slow due to the number of tests.
First, let's generate a small practice dataset to play with:
dat = as.data.frame(matrix(runif(10000*500), ncol = 10000, nrow = 500))
dat$group = rep(letters[1:10], 5000)
Method 1 (based on 'sapply'):
system.time(sapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
143.76 0.33 151.79
Methods 2 (based on 'mclapply' from 'parallel' package):
library(parallel)
options(mc.cores=3)
system.time(mclapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
141.76 0.21 142.58
Methods 3 (based on 'cbind'-ing the LHS):
formula = as.formula( paste0("cbind(", paste(names(dat)[-length(dat)],collapse=","), ")~group") )
system.time(aov(formula, data=dat))
user system elapsed
10.00 0.22 10.25
In the practice dataset, Method 3 is a clear winner. However, when I do this on my actual data, computing on just 10 (of 433,685) columns using Method 3 takes this long:
user system elapsed
119.028 5.430 124.414
Not sure why it takes substantially longer on my actual data. I have access to a Linux cluster with upwards of 16 cores and 72GB of RAM.
Is there any way to compute this faster?
For simultaneously fitting many general linear models (such as ANOVA) using the same design matrix, the Bioconductor/R limma package provides a very fast lmFit() function. This is how to fit an ANOVA model using limma:
library(limma)
# generate some data
# (same dimensions as in your question)
nrows <- 1e4
ncols <- 5e2
nlevels <- 10
dat <- matrix(
runif(nrows * ncols),
nrow = nrows,
ncol = ncols
)
group <- factor(rep(
letters[1:nlevels],
ncols / nlevels
))
# construct the design matrix
# (same as implicitly used in your question)
dmat <- model.matrix(~ group)
# fit the ANOVA model
fit <- lmFit(dat, dmat)
On my laptop it finished in 0.4 - 0.45 seconds, on data of the same dimensions as the data in your question.
I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.
I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101
To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5
Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67
Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')
I would like to run a discrete-time simulation (simplified version below). I generate a data frame of population members (one member per row) with their timestamps for entering and exiting a website. I then wish to count at each time interval how many members are on the site.
Currently I am looping through time and at each second counting how many members have entered and not yet exited. (I have also tried destructive iteration by removing exited members at each interval, which takes even longer. I also understand that I can use larger time intervals in the loop.)
How do I use linear algebra to eliminate the for-loop and excess runtime? My current approach does not scale well as population increases, and of course it is linear with respect to duration.
popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
visitorLoad[i] <- sum(popEvents$enterTimestamp <= i &
popEvents$exitTimestamp > i)
if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
You can obtain the counts of visitors entering and exiting at different times and then use the cumulative sum to compute the number of visitors there at a particular time. This seems to meet your requirement of the code running quickly, though it does not use linear algebra.
diffs = rep(0, simDuration+1)
# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))
# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit
# cumsum() sums the diffs vector through a particular time point.
visitorLoad2 = head(cumsum(diffs), simDuration)
How about this for simplicity:
vl<-unlist(lapply(1:simDuration,function(i)sum((enterTimestamp<=i)*(exitTimestamp>i))))
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
It's twice as fast as current loop, but if performance is more important then #josilber 's solution is better, or maybe something with data.table(), will have a think...
EDIT - how about this for speed:
require(data.table)
require(plyr) # for count() function
system.time({
enter<-data.table(count(ceiling(enterTimestamp))) # entries grouped by second
exit<-data.table(count(ceiling(exitTimestamp))) # exits grouped by second
sim<-data.table(x=1:simDuration) # time index
merged<-merge(merge(sim,enter,by="x",all.x=T),exit,by="x",all.x=T)
mat<-data.matrix(merged[,list(freq.x,freq.y)]) # make matrix to remove NAs
mat[is.na(mat)]<-0 # remove NAs, there are quicker ways but more complicated
vl<-cumsum(mat[,1]-mat[,2]) # cumsum() to roll up the movements
})
user system elapsed
0.02 0.00 0.02
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
** FURTHER EDIT ** - balance of performance and simplicity
system.time(cumsum(data.frame(table(cut(enterTimestamp,0:10000))-table(cut(exitTimestamp,0:10000)))[,2]))
user system elapsed
0.09 0.00 0.10
How can I initialise a unstructured covariance matrix for the following model?
y<-data.frame(response=c(10,19,27,28,9,13,25,29,4,10,20,18,5,6,12,17),
treatment=factor(rep(1:4,4)),
subject=factor(rep(1:4,each=4))
)
fit<-lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject))
I tried some variants but I get every time I get the error:
Error in lme.formula(response ~ -1 + treatment, y, random = ~1 | :
nlminb problem, convergence error code = 1
message = function evaluation limit reached without convergence (9)
It's practically difficult to fit an unstructured correlation matrix with 6 parameters in addition to a treatment mean effect (4 parameters), a random-effects variance (1), and a residual variance (1) to a data set with only 16 points. If I try with a larger, randomized version of your data set, it works fine.
nSubj <- 20
respVec <- c(10,19,27,28,9,13,25,29,4,10,20,18,5,6,12,17)
set.seed(101)
y<-data.frame(response=sample(respVec,size=4*nSubj,replace=TRUE),
treatment=factor(rep(1:4,nSubj)),
subject=factor(rep(1:nSubj,each=4))
)
library(nlme)
fit<-lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject),
control=lmeControl(msVerbose=TRUE))
Now we can experiment and see how small a data set we can get away with. Package the stuff above into a test function that simulates data and tries a fit, returning TRUE if the fit fails:
testFun <- function(nSubj) {
y<-data.frame(response=sample(respVec,size=4*nSubj,replace=TRUE),
treatment=factor(rep(1:4,nSubj)),
subject=factor(rep(1:nSubj,each=4))
)
fit <- try(lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject)),silent=TRUE)
inherits(fit,"try-error")
}
Try the test function N times and report the proportion of failures:
testFun2 <- function(nSubj,N) {
mean(replicate(N,testFun(nSubj)))
}
Try it out for a range of numbers of subjects (slow):
set.seed(101)
testRes <- sapply(4:20,testFun2,N=50)
Results:
## [1] 0.64 0.04 0.00 0.00 ... 0.00
Somewhat to my surprise, this will work a third of the time with 4 subjects; 96% of the time with 5 subjects: and always with >5 subjects.