How can i speed up this loop in R? - r

set.seed(155656494)
#setting parameter values
n<-500
sdu<-25
beta0<-40
beta1<-12
# Running the simulation again
# create the x variable outside the loop since it’s fixed in
# repeated sampling
x2 <- floor(runif(n,5,16))
# set the number of iterations for your simulation (how many values
# of beta1 will be estimated)
nsim2 <- 10000000
# create a vector to store the estimated values of beta1
vbeta2 <- numeric(nsim2)
# create a loop that produces values of y, regresses y on x, and
# stores the OLS estimate of beta1
for (i in 1:nsim2) {
y2 <- beta0 + beta1*x2 + 0.2*x2 + rnorm(n,mean=0,sd=sdu)
model2 <- lm(y2 ~ x2)
vbeta2[i] <- coef(model2)[[2]]
}
mean(vbeta2)
The above is a simple linear regression model that has 10 million iterations. I looking for help with speeding up the loop. This code basically runs as y2 <- beta0 + beta1x2 + 0.2x2 + rnorm(n,mean=0,sd=sdu), which will then be used to calculate the mean of vbeta2

You could use profvis to determine where is processor time spent:
library(profvis)
profvis({
set.seed(155656494)
#setting parameter values
n<-500
sdu<-25
beta0<-40
beta1<-12
# Running the simulation again
# create the x variable outside the loop since it’s fixed in
# repeated sampling
x2 <- floor(runif(n,5,16))
# set the number of iterations for your simulation (how many values
# of beta1 will be estimated)
nsim2 <- 1000
# create a vector to store the estimated values of beta1
vbeta2 <- numeric(nsim2)
# create a loop that produces values of y, regresses y on x, and
# stores the OLS estimate of beta1
for (i in 1:nsim2) {
y2 <- beta0 + beta1*x2 + 0.2*x2 + rnorm(n,mean=0,sd=sdu)
model2 <- lm(y2 ~ x2)
vbeta2[i] <- coef(model2)[[2]]
}
mean(vbeta2)
})
The result shows that most of the time is spent evaluating the linear regression model :
As suggested by #maarvd, you could paralellize to speed this up. However parallelizing each single calculation won't be efficient because one calculation is too fast (~0.5 ms), so you'll have to distribute chuncks of many thousand calculations per worker. I agree with #Allan Cameron, is this worth the effort?

Related

Running random error model with mgcv gam takes too much memory

I am working on a model that includes several REs and a spline for one of the variables, so I am trying to use gam(). However, I reach memory exhaust limit error (even when I run it on a cluster with 128GB). This happens even when I run the simplest of models with just one RE. The same models (minus the spline) run smoothly and in just a few seconds (or minutes for the full model) when I use lmer() instead.
I was wondering if anyone had any idea why the discrepancy between gam() and lmer() and any potential solutions.
Here's some code with simulated data and the simplest of models:
library(mgcv)
library(lme4)
set.seed(1234)
person_n <- 38000 # number of people (grouping variable)
n_j <- 15 # number of data points per person
B1 <- 3 # beta for the main predictor
n <- person_n * n_j
person_id <- gl(person_n, k = n_j) #creating the grouping variable
person_RE <- rep(rnorm(person_n), each = n_j) # creating the random errors
x <- rnorm(n) # creating x as a normal dist centered at 0 and sd = 1
error <- rnorm(n)
#putting it all together
y <- B1 * x + person_RE + error
dat <- data.frame(y, person_id, x)
m1 <- lmer(y ~ x + (1 | person_id), data = dat)
g1 <- gam(y ~ x + s(person_id, bs = "re"), method = "REML", data = dat)
m1 runs in just a couple seconds on my computer, whereas g1 hits the error:
Error: vector memory exhausted (limit reached?)
From ?mgcv::random.effects:
gam can be slow for fitting models with large numbers of random
effects, because it does not exploit the sparsity that is often a
feature of parametric random effects ... However ‘gam’ is often
faster and more reliable than ‘gamm’ or ‘gamm4’, when the number
of random effects is modest. [emphasis added]
What this means is that in the course of setting up the model, s(., bs = "re") tries to generate a dense model matrix equivalent to model.matrix( ~ person_id - 1); this takes (nrows x nlevels x 8 bytes/double) = (3.8e4*5.7e5*8)/2^30 = 161.4 Gb (which is exactly the object size that my machine reports it can't allocate).
Check out mgcv::gamm and gamm4::gamm4 for more memory-efficient (and faster, in this case) methods ...

Different set.seed each run in R

I want to "measure" which Regression Method is more robust to the outliers.
For this, I sum the variances of model coefficients. Each run, I generate data from t-distribution. I set.seed Ten times to have Ten specific data.
However, I also want to have Ten different seed each run. So, in total, I will have 10 sums of the variances. The code below is giving me one sum of the first (Ten different seed).
How can I do this?
#######################################
p <- 5
n <- 50
#######################################
FX <- function(seed, data) {
#for loops over a seed #
for (i in seed) {
set.seed(seed)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
dat=cbind(x,y)
data<-as.data.frame(dat)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs <- coef(lm1)
lad1 <- lad(y ~ ., data=data, method="BR")
lad.coefs <- coef(lad1)
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=var(lm.coefs), lad=var(lad.coefs)), "seed", seed))
}
#######################################
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Here is something closer to your intended results.
The code below fixes a key issues from your original code. It was not clear on what data was intended to be returned from the function.
This creates a vector of seeds numbers inside the function
This also creates a vector to inside the function to store the value of the variance of coefficients for each iteration of the loop. (not sure if is what you want).
I needed to comment out the lad function since I do not know which package this is from. (you would need to follow 2 from above to add this back in.
Some general clean of the code
p <- 5
n <- 50
FX <- function(seed, data) {
#for loops over a seed #
#Fixes the starting seed issue
startingSeed <- (seed-1)*10 +1
seeds <- seq( startingSeed, startingSeed+9)
#create vector to store results from loop iteration
lm.coefs <- vector(mode="numeric", length=10)
index <- 1
for (i in seeds) {
set.seed(i)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
data<-data.frame(x, y)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs[index] <- var(coef(lm1))
# lad1 <- lad(y ~ ., data=data, method="BR")
# lad.coefs <- coef(lad1)
index <- index +1
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=lm.coefs), "seed", seed))
}
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Hope this provides the answer or at least guidance to solve your problem.

Random forest variable importance AND direction of correlation for binomial response

I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

random effects variance of intercept being zero

I am running a power analysis using a normal LMM in R. I have seven input parameters, two of which I do not need to test for (no. of years and no. of sites). The other 5 parameters are the intercept, slope and the random effects standard deviation of the residual, intercept and slope.
Given that my response data (year is the sole explanatory variable in the model) is bound between (-1, +1), the intercept also falls in this range. However, what I am finding is that if I run, say, 1000 simulations with a given intercept and slope (which I am treating as constant over 10 years), then if the random effects intercept SD falls below a certain value, there are many simulations where the random effects intercept SD is zero. If I inflate the intercept SD then this seems to simulate correctly (please see below where I use residual Sd=0.25, intercept SD = 0.10 and slope SD = 0.05; if I increase intercept SD to 0.2, this is correctly simulated; or if I drop the residual SD to say 0.05, the variance parameters are correctly simulated).
Is this problem due to my coercion of the range to be (-1, +1)?
I include the code for my function and the processing of the simulations below, if this would help:
Function: generating the data:
normaldata <- function (J, K, beta0, beta1, sigma_resid,
sigma_beta0, sigma_beta1){
year <- rep(rep(0:J),K) # 0:J replicated K times
site <- rep (1:K, each=(J+1)) # 1:K sites, repeated J years
mu.beta0_true <- beta0
mu.beta1_true <- beta1
# random effects variance parameters:
sigma_resid_true <- sigma_resid
sigma_beta0_true <- sigma_beta0
sigma_beta1_true <- sigma_beta1
# site-level parameters:
beta0_true <<- rnorm(K, mu.beta0_true, sigma_beta0_true)
beta1_true <<- rnorm(K, mu.beta1_true, sigma_beta1_true)
# data
y <<- rnorm(n = (J+1)*K, mean = beta0_true[site] + beta1_true[site]*(year),
sd = sigma_resid_true)
# NOT SURE WHETHER TO IMPOSE THE LIMITS HERE OR LATER IN CODE:
y[y < -1] <- -1 # Absolute minimum
y[y > 1] <- 1 # Absolute maximum
return(data.frame(y, year, site))
}
Processing the simulated code:
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
n.sims = 1000
sigma.resid <- rep(0, n.sims)
sigma.intercept <- rep(0, n.sims)
sigma.slope <- rep(0,n.sims)
intercept <- rep(0,n.sims)
slope <- rep(0,n.sims)
signif <- rep(0,n.sims)
for (s in 1:n.sims){
y.data <- normaldata(10,200, 0.30, ((0-0.30)/10), 0.25, 0.1, 0.05)
lme.power <- lmer(y ~ year + (1+year | site), data=y.data)
summary(lme.power)
theta.hat <- fixef(lme.power)[["year"]]
theta.se <- se.fixef(lme.power)[["year"]]
signif[s] <- ((theta.hat + 1.96*theta.se) < 0) |
((theta.hat - 1.96*theta.se) > 0) # returns TRUE or FALSE
signif[s]
betas <- fixef(lme.power)
intercept[s] <- betas[1]
slope[s] <- betas[2]
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
sigma.resid[s] <- vc1[4,5]
sigma.intercept[s] <- vc2[1]
sigma.slope[s] <- vc2[2]
cat(paste(s, " ")); flush.console()
}
power <- mean (signif) # proportion of TRUE
power
summary(sigma.resid)
summary(sigma.intercept)
summary(sigma.slope)
summary(intercept)
summary(slope)
Thank you in advance for any help you can offer.
This is really more of a statistical than a computational question, but the short answer is: you haven't made any mistakes, this is exactly as expected. This example on rpubs runs some simulations of a Normally distributed response (i.e. it corresponds exactly to the model assumed by LMM software, so the constraint you're worried about isn't an issue).
The lefthand histogram below is from simulations with 25 samples in 5 groups, equal variance (of 1) within and between groups; the righthand histogram is from simulations with 15 samples in 3 groups.
The sampling distribution of variances for null cases (i.e., no real between-group variation) is known to have a point mass or "spike" at zero; it's not surprising (although as far as I know not theoretically worked out) that the sampling distribution of the variances should also have a point mass at zero when the between-sample is non-zero but small and/or when the sample is small and/or noisy.
http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-variance has more on this topic.

Resources