How long should a Monte Carlo bootstrap power analysis simulation in R take? Is it potentially hours? (1000 reps, 1000 bootstraps) - r

I am using a Monte Carlo simulation to run a power analysis for a longitudinal mediation model. I'm using the power.boot function from the bmem package (lavaan).
I checked the code with only 5 reps/5 bootstrap to make sure it worked and it did.
Then I ran the code with 1000 reps, 1000 bootstrap as the package documentation recommends.
It's been over an hour now and it's still running - is this normal? How long is too long?
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-power.boot(powermodel1, indirect, N, nrep=1000, nboot=1000, parallel = 'multicore'))
summary(bootstrap)

Unfortunately it looks like it will take a while; ~8hrs on my system:
library(bmem)
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 10, nboot = 10, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 30, nboot = 30, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 60, nboot = 60, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 100, nboot = 100, parallel = 'multicore'))
library(tidyverse)
# Load the times from above into a dataframe
benchmark <- tibble(bootstraps = c(10, 30, 60, 100),
times = c(4.021, 30.122, 121.103, 311.236))
# Plot the points and fit a curve
ggplot(benchmark, aes(x = bootstraps, y = times)) +
geom_point() +
geom_smooth(se = FALSE, span = 5)
# Fit a model
fit <- lm(data = benchmark, times~poly(bootstraps,
2, raw=TRUE))
newtimes <- data.frame(bootstraps = seq(100, 1000, length = 4))
# Predict the time it will take for larger bootstrap/rep values
predict(fit, newdata = newtimes)
> 1 2 3 4
> 311.6829 4568.3812 13789.6754 27975.5655
# Convert from seconds to hours
print(27975.5655/60/60)
>[1] 7.77099

Related

Shortening the formula syntax of a regression model

I was wondering if the syntax of the regression model below could be made more concise (shorter) than it currently is?
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/bv1.csv')
library(nlme)
model <- lme(achieve ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time), correlation = corSymm(),
random = ~0 + D1:time | schcode/id, data = dat, weights = varIdent(form = ~1|factor(math)),
na.action = na.omit, control = lmeControl(maxIter = 200, msMaxIter = 200, niterEM = 50,
msMaxEval = 400))
coef(summary(model))
Focusing on the fixed-effect component only.
Original formula:
form1 <- ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time)
X1 <- model.matrix(form1, data=dat)
I think this is equivalent
form2 <- ~0 +
D1 + D2 +
(D1+D2):(time + schcontext + female + female:time+schcontext:time)
X2 <- model.matrix(form2, data=dat)
(Unfortunately ~ 0 + (D1 + D2):(1 + time + ...) doesn't work as I would have liked/expected.)
For a start, the model matrix has the right dimensions. Staring at the column names of the model matrices and reordering the columns manually:
X2o <- X2[,c(1:3,6,4,7,5,8,9,11,10,12)]
all.equal(c(X1),c(X2o)) ##TRUE
(For numerical predictors, you don't need I(A*B): A:B is equivalent.)
Actually you can do a little better using the * operator
form3 <- ~0 +
D1 + D2 +
(D1+D2):(time*(schcontext+female))
X3 <- model.matrix(form3, data=dat)
X3o <- X3[,c(1:3,6,4,7,5,8,10,12,9,11)]
all.equal(c(X1),c(X3o)) ## TRUE
Compare formula length:
sapply(list(form1,form2,form3),
function(x) nchar(as.character(x)[[2]]))
## [1] 183 84 54

Visualization of predict glm using multiple variables in R

I would like to use the following dataset to fit glm and visualize the predict().
y=c(-18.948,-19.007,-18.899,-19.022,-20.599,-19.778,-17.344,-20.265,-20.258,-19.886,-18.05,-19.824,-20.1,-20.508,-20.455,-16.573,-20.249,-20.205,-20.436,-16.358,-17.717,-19.794,-20.372,-19.944,-20.072,-19.889,-20.139,-19.132,-20.275,-19.953,-19.769,-20.2,-19.638,-17.419,-19.086,-18.347,-18.73,-18.872,-18.956,-19.28,-18.176,-19.036,-18.084,-20.11,-19.641,-19.656,-19.25,-18.68,-19.089,-18.969,-18.161,-17.603,-20.37,-19.233,-18.961,-19.083,-20.118,-19.795,-17.154,-16.75)
x1=c(9.698,9.583,9.356,9.326,9.438,9.733,8.803,8.973,9.141,9.044,8.788,9.377,9.26,10.186,9.035,9.569,9.431,9.09,8.776,9.117,9.393,9.408,9.307,8.868,8.398,8.407,9.364,9.074,8.444,9.122,10.11,7.81,9.777,6.472,9.521,8.92,9.341,9.446,9.08,8.071,8.047,8.019,7.419,9.022,9.981,9.337,9.989,10.013,9.31,10.843,8.337,9.103,6.438,9.372,9.071,8.749,9.016,8.181,9.284,8.44)
x2=c('S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S03','S04','S04','S04','S04','S04','S04','S06','S06','S06','S06','S06','S06','S06','S06','S07','S07','S07','S07','S07','S07','S07','S07','S07','S08','S08','S09','S09','S09','S09','S09','S09','S09','S10','S03','S03','S03','S04','S04','S07','S07')
x3=c('A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','A1','P1','P1','P1','P1','P1','P1','P1')
mydata <- data.frame(y,x1,x2,x3)
Fit glm Model:
myglm <- glm(y ~ x1+x2+x3+x1:x2, family="gaussian", data= mydata)
Prediction:
1). Extract the range of x1
min <- min(mydata$x1)
max <- max(mydata$x1)
2). Create a new data frame.x
Here comes the question:
How should I include x2 and x3 in the new.x?
new.x <- data.frame(
x1=seq(min, max, length=60),
x2= ???
x3= ???)
Then predict new.y with myglm:
new.y = predict(myglm, newdata=new.x, se.fit=TRUE)
Combine new.x and new.y:
addThese <- data.frame(new.x, new.y)
interval
addThese <- mutate(addThese,
d15N=exp(fit),
lwr=exp(fit-1.96*se.fit),
upr=exp(fit+1.96*se.fit))
3). Visualization of the original data points and the glm prediction smooth line added:
ggplot(addThese, aes(x1, fit))+
geom_point(shape=21, size=3)+
geom_smooth(data=addThese,
aes(ymin=lwr, ymax=upr),
stat='identity')
I'm still wondering if this is a right way to create new.data, but I'll give it a try. So with your data, slightly modifying your code:
myglm <- glm(y ~ x1 + x2 + x3 + x1:x2, family = gaussian, data = mydata)
minx <- min(mydata$x1)
maxx <- max(mydata$x1)
# create data with all combinations of x1, x2, x3
new.data <- expand.grid(x1 = seq(minx, maxx, length.out = 60),
x2 = unique(mydata$x2),
x3 = unique(mydata$x3)
)
# visualize data
data.frame(predict(myglm, newdata = new.data, se.fit = T)[1:2]) %>%
bind_cols(new.data) %>%
mutate(d15N = exp(fit), lwr = fit - 1.96 * se.fit, upr = fit + 1.96 * se.fit) %>%
ggplot(aes(x = x1, y = fit, colour = interaction(x2, x3))) +
geom_point(size = 1, alpha = .75, pch = 19, position = "jitter") +
geom_smooth(aes(ymin = lwr, ymax = upr), stat = "identity", alpha = .5) +
facet_wrap(~interaction(x2, x3, sep = " : "), nrow = 5) +
ggthemes::theme_few() +
labs(y = "Predicted value", x = bquote(x[1])) +
theme(legend.position = "none")

Error when predicting with DirichletReg package in R

I am trying to make predictions on a test set using the DirichReg function from the DirichletReg package. When I run the model with only a few predictors it works fine, but when I use more than ~5 predictors I get an error that I can't figure out. The code below creates an MWE that reproduces the error.
library(DirichletReg)
set.seed(1)
# create dataset
predictor1 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor2 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor3 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor4 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor5 <- rnorm(n = 1000, mean = 5, sd = 1)
predictor6 <- rnorm(n = 1000, mean = 5, sd = 1)
prob_A <- runif(n = 1000, min = 0, max = 0.5)
prob_B <- runif(n = 1000, min = 0, max = 0.5)
prob_C <- 1 - prob_A - prob_B
dat <- data.frame(predictor1, predictor2, predictor3, predictor4, predictor5,
predictor6, prob_A, prob_B, prob_C)
# split data into training and test sets
train_vec <- sample(c(0, 1), size = nrow(dat), replace = T, prob = c(0.2, 0.8))
train_dat <- dat[train_vec == 1, ]
test_dat <- dat[train_vec == 0, ]
# run model
train_dat$prob <- DR_data(train_dat[, c('prob_A', 'prob_B', 'prob_C')])
mod <- DirichReg(prob ~ predictor1 + predictor2 + predictor3 + predictor4 +
predictor5 + predictor6,
data = train_dat, model = 'common')
# run predictions
test_dat$prob <- DR_data(test_dat[, c('prob_A', 'prob_B', 'prob_C')])
preds <- predict(object = mod, newdata = test_dat)
Here's the error that I'm getting:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:74: unexpected '|'
1: prob ~ predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + |
^
I would appreciate any help. I haven't been able to google the error or find it in the package documentation.
This seems to be a bug in the package. I recommend that you contact the package maintainer to report it.
A possible workaround is to explicitly list the separate parts of the regression specification instead of relying on the package to internally replicate the regressors for all parts.
mod2 <- DirichReg(prob ~
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6 |
predictor1 + predictor2 + predictor3 + predictor4 + predictor5 + predictor6,
data = train_dat, model = "common")
all.equal(coef(mod), coef(mod2))
## [1] TRUE
predict(mod2, newdata = test_dat)
## [,1] [,2] [,3]
## [1,] 0.2436493 0.2715895 0.4847612
## [2,] 0.2541715 0.2252292 0.5205993
## [3,] 0.2618741 0.2345063 0.5036196
## ...

Fitting a multivariate polynomial of generic degree in R without having to write the explicit formula

I would like to fit a multivariate polynomial of arbitrary degree and in an arbitrary number of variables, to some data. The number of variables can be high (for example 40) and the code should work for different numbers of variables (e.g., 10, 20, 40, etc.)., so it's not possible to write out the formula explicitly. For a degree 1 polynomial (i.e., the classic linear model), the solution is trivial: suppose I have my data in the dataframe df, then
mymodel <- lm(y ~ ., data = df)
Unfortunately I don't know of a similar compact formula when the polynomial is of arbitrary degree. Can you help me?
This combines both options from my earlier posting (interactions and polynomial terms) in a hypothetical situation where the column names look like "X1", "X2", ...., "X30". You would take out the terms() call which is just in there to demonstrate that it was successful:
terms( as.formula(
paste(" ~ (", paste0("X", 1:30 , collapse="+"), ")^2", "+",
paste( "poly(", paste0("X", 1:30), ", degree=2)",
collapse="+"),
collapse="")
) )
You could use an expression like names(dfrm)[!names(dfrm) %in% "y"] instead of the inner paste0 calls.
Note that the interaction terms are constructed by way of the R formula process in with the (...)^2 mechanism which is no creating squared terms but rather all of hte two way interactions:
as.formula(
paste(" ~ (", paste0("X", 1:30 , collapse="+"), ")^2", "+", paste( "poly(", paste0("X", 1:30), ", degree=2)", collapse="+"), collapse="")
)
#----output----
~(X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 +
X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20 + X21 + X22 +
X23 + X24 + X25 + X26 + X27 + X28 + X29 + X30)^2 + poly(X1,
degree = 2) + poly(X2, degree = 2) +
poly(X3, degree = 2) +
poly(X4, degree = 2) + poly(X5, degree = 2) + poly(X6, degree = 2) +
poly(X7, degree = 2) + poly(X8, degree = 2) + poly(X9, degree = 2) +
poly(X10, degree = 2) + poly(X11, degree = 2) + poly(X12,
degree = 2) + poly(X13, degree = 2) + poly(X14, degree = 2) +
poly(X15, degree = 2) + poly(X16, degree = 2) + poly(X17,
degree = 2) + poly(X18, degree = 2) + poly(X19, degree = 2) +
poly(X20, degree = 2) + poly(X21, degree = 2) + poly(X22,
degree = 2) + poly(X23, degree = 2) + poly(X24, degree = 2) +
poly(X25, degree = 2) + poly(X26, degree = 2) + poly(X27,
degree = 2) + poly(X28, degree = 2) + poly(X29, degree = 2) +
poly(X30, degree = 2)
You can use this function makepoly that generates a formula with polynomial terms based on a formula and a data frame.
makepoly <- function(form, data, degree = 1) {
mt <- terms(form, data = data)
tl <- attr(mt, "term.labels")
resp <- tl[attr(mt, "response")]
reformulate(paste0("poly(", tl, ", ", degree, ")"),
response = form[[2]])
}
A test data set:
set.seed(1)
df <- data.frame(y = rnorm(10),
x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10))
Create the formula and run the regression:
form <- makepoly(y ~ ., df, degree = 2)
# y ~ poly(x1, 2) + poly(x2, 2) + poly(x3, 2)
lm(form, df)
#
# Call:
# lm(formula = form, data = df)
#
# Coefficients:
# (Intercept) poly(x1, 2)1 poly(x1, 2)2 poly(x2, 2)1
# 0.1322 0.1445 -5.5757 -5.2132
# poly(x2, 2)2 poly(x3, 2)1 poly(x3, 2)2
# 4.2297 0.7895 3.9796

R neuralnet does not converge within stepmax for time series

I'm writing a neural network for prediction of elements in a time series x + sin(x^2) in R, using the neuralnet package. This is how training data is being generated, assuming a window of 4 elements, and that the last one is the one that has to be predicted:
nntr0 <- ((1:25) + sin((1:25)^2))
nntr1 <- ((2:26) + sin((2:26)^2))
nntr2 <- ((3:27) + sin((3:27)^2))
nntr3 <- ((4:28) + sin((4:28)^2))
nntr4 <- ((5:29) + sin((5:29)^2))
Then, I turn these into a data.frame:
nntr <- data.frame(nntr0, nntr1, nntr2, nntr3, nntr4)
Then, I proceed to train the NN:
net.sinp <- neuralnet(nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data=nntr, hidden=10, threshold=0.04, act.fct="tanh", linear.output=TRUE, stepmax=100000)
Which, after a while, gives me the message
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
Call: neuralnet(formula = nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data = nntr, hidden = 10, threshold = 0.04, stepmax = 100000, act.fct = "tanh", linear.output = TRUE)
Can anyone help me figure out why it is not converging? Many thanks
With tanh as an activation function (it is bounded),
it is very difficult to reproduce the linear trend in your signal.
You can use linear activation functions instead,
or try to detrend the signal.
# Data
dx <- 1
n <- 25
x <- seq(0,by=dx,length=n+4)
y <- x + sin(x^2)
y0 <- y[1:n]
y1 <- y[1 + 1:n]
y2 <- y[2 + 1:n]
y3 <- y[3 + 1:n]
y4 <- y[4 + 1:n]
d <- data.frame(y0, y1, y2, y3, y4)
library(neuralnet)
# Linear activation functions
r <- neuralnet(y4 ~ y0 + y1 + y2 + y3, data=d, hidden=10)
plot(y4, compute(r, d[,-5])$net.result)
# No trend
d2 <- data.frame(
y0 = y0 - x[1:n],
y1 = y1 - x[1 + 1:n],
y2 = y2 - x[2 + 1:n],
y3 = y3 - x[3 + 1:n],
y4 = y4 - x[4 + 1:n]
)
r <- neuralnet(y4 ~ y0 + y1 + y2 + y3, data=d2, hidden=10, act.fct="tanh" )
plot(d2$y4, compute(r, d2[,-5])$net.result)
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmaxmeans your algorithm reached the limited steps before it is converged. If you type ?neuralnet and see the definition for stepmax it says,
the maximum steps for the training of the neural network. Reaching this maximum leads to a stop of the neural network's training process.
For your problem, I recommend you to increase your stepmax value to 1e7 and see what happens.
The code will be,
net.sinp <- neuralnet(nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data=nntr, hidden=10, threshold=0.04, act.fct="tanh", linear.output=TRUE, stepmax=1e7)

Resources