Different set.seed each run in R - r

I want to "measure" which Regression Method is more robust to the outliers.
For this, I sum the variances of model coefficients. Each run, I generate data from t-distribution. I set.seed Ten times to have Ten specific data.
However, I also want to have Ten different seed each run. So, in total, I will have 10 sums of the variances. The code below is giving me one sum of the first (Ten different seed).
How can I do this?
#######################################
p <- 5
n <- 50
#######################################
FX <- function(seed, data) {
#for loops over a seed #
for (i in seed) {
set.seed(seed)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
dat=cbind(x,y)
data<-as.data.frame(dat)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs <- coef(lm1)
lad1 <- lad(y ~ ., data=data, method="BR")
lad.coefs <- coef(lad1)
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=var(lm.coefs), lad=var(lad.coefs)), "seed", seed))
}
#######################################
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.

Here is something closer to your intended results.
The code below fixes a key issues from your original code. It was not clear on what data was intended to be returned from the function.
This creates a vector of seeds numbers inside the function
This also creates a vector to inside the function to store the value of the variance of coefficients for each iteration of the loop. (not sure if is what you want).
I needed to comment out the lad function since I do not know which package this is from. (you would need to follow 2 from above to add this back in.
Some general clean of the code
p <- 5
n <- 50
FX <- function(seed, data) {
#for loops over a seed #
#Fixes the starting seed issue
startingSeed <- (seed-1)*10 +1
seeds <- seq( startingSeed, startingSeed+9)
#create vector to store results from loop iteration
lm.coefs <- vector(mode="numeric", length=10)
index <- 1
for (i in seeds) {
set.seed(i)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
data<-data.frame(x, y)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs[index] <- var(coef(lm1))
# lad1 <- lad(y ~ ., data=data, method="BR")
# lad.coefs <- coef(lad1)
index <- index +1
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=lm.coefs), "seed", seed))
}
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Hope this provides the answer or at least guidance to solve your problem.

Related

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

Creating a for loop to calculate AIC scores for different models using lm

Im trying to create AIC scores for several different models in a for loop.
I have created a for loop with the log likeliness for each model. However, I am stuck to create the lm function so that it calculates a model for each combination of my column LOGABUNDANCE with columns 4 to 11 of my dataframe.
This is the code I have used so far. But that gives me a similar AIC score for every model.
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
for(i in 1:10){
mod <- lm(LOGABUNDANCE ~ . , data = butterfly)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(butterfly$LOGABUNDANCE, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
You get the same AIC for every model, because you create 10 equal models.
To make the code work, you need some way of changing the model in each iteration.
I can see two options:
Either subset the data in the start of each iteration so it only contains LOGABUNDANCE and one other variable (as suggested by #yacine-hajji in the comments), or
Create a vector of the variables you want to create models with, and use as.formula() together with paste0() to create a new formula for each iteration.
I think solution 2 is easier. Here is a working example of solution 2, using mtcars:
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
# Say I want to model all variables against `mpg`:
# Create a vector of all variable names except mpg
variables <- names(mtcars)[-1]
for(i in 1:10){
# Note how the formula is different in each iteration
mod <- lm(
as.formula(paste0("mpg ~ ", variables[i])),
data = mtcars
)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(mtcars$mpg, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
Output:
AIC
#> [1] 167.3716 168.2746 179.3039 188.8652 164.0947 202.6534 190.2124 194.5496
#> [9] 200.4291 197.2459

Random forest variable importance AND direction of correlation for binomial response

I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

How can i speed up this loop in R?

set.seed(155656494)
#setting parameter values
n<-500
sdu<-25
beta0<-40
beta1<-12
# Running the simulation again
# create the x variable outside the loop since it’s fixed in
# repeated sampling
x2 <- floor(runif(n,5,16))
# set the number of iterations for your simulation (how many values
# of beta1 will be estimated)
nsim2 <- 10000000
# create a vector to store the estimated values of beta1
vbeta2 <- numeric(nsim2)
# create a loop that produces values of y, regresses y on x, and
# stores the OLS estimate of beta1
for (i in 1:nsim2) {
y2 <- beta0 + beta1*x2 + 0.2*x2 + rnorm(n,mean=0,sd=sdu)
model2 <- lm(y2 ~ x2)
vbeta2[i] <- coef(model2)[[2]]
}
mean(vbeta2)
The above is a simple linear regression model that has 10 million iterations. I looking for help with speeding up the loop. This code basically runs as y2 <- beta0 + beta1x2 + 0.2x2 + rnorm(n,mean=0,sd=sdu), which will then be used to calculate the mean of vbeta2
You could use profvis to determine where is processor time spent:
library(profvis)
profvis({
set.seed(155656494)
#setting parameter values
n<-500
sdu<-25
beta0<-40
beta1<-12
# Running the simulation again
# create the x variable outside the loop since it’s fixed in
# repeated sampling
x2 <- floor(runif(n,5,16))
# set the number of iterations for your simulation (how many values
# of beta1 will be estimated)
nsim2 <- 1000
# create a vector to store the estimated values of beta1
vbeta2 <- numeric(nsim2)
# create a loop that produces values of y, regresses y on x, and
# stores the OLS estimate of beta1
for (i in 1:nsim2) {
y2 <- beta0 + beta1*x2 + 0.2*x2 + rnorm(n,mean=0,sd=sdu)
model2 <- lm(y2 ~ x2)
vbeta2[i] <- coef(model2)[[2]]
}
mean(vbeta2)
})
The result shows that most of the time is spent evaluating the linear regression model :
As suggested by #maarvd, you could paralellize to speed this up. However parallelizing each single calculation won't be efficient because one calculation is too fast (~0.5 ms), so you'll have to distribute chuncks of many thousand calculations per worker. I agree with #Allan Cameron, is this worth the effort?

multiple linear regression: error in user defined function

I have written my function for MLR. However, there seems to an issue with output (see examples in the end).
But when I run the code, line by line, the output is correct.
mlr <- function(dependentvar, dataset) {
x <- model.matrix(dependentvar ~., dataset) # Design Matrix for x
y <- dependentvar # dependent variable
betas <- solve(crossprod(x))%*%crossprod(x,y) # beta values
SST <- t(y)%*%y - (sum(y)^2/dim(dataset)[1]) # total sum of squares
SSres <- t(y)%*%y -(t(betas)%*%crossprod(x,y)) # sum of squares of residuals
SSreg <- SST - SSres # regression sum of squares
sigmasqr <- SSres/(length(y) - dim(dataset)[2]) # variance or (MSE)
varofbeta <- sigmasqr[1]*solve( crossprod(x)) # variance of beta
cat("SST:", SST,"SSresiduals:", SSres,"SSregression:", SSreg, sep = "\n", append = FALSE)
return(betas)
}
To see the problem, try
mlr(trees$Height, trees)
I get the same problem even if I get rid of $
Height <- trees$Height
mlr(Height, trees)
Use the following:
x <- model.matrix(reformulate(".", dependentvar), dataset)
y <- dataset[[dependentvar]]
and pass in dependentvar as a string.
Example:
mlr("Height", trees)

Resources