Iteration for Simple Linear Regression and Prediction in R - r

Using R, I would like to run n iterations for generating n simple linear regression models using a training dataset to predict n sets of fitted values for the test data. This would involve storing the models and predictions in appropriate containers. I would also like to label the predictions using a vector of labels. The data are structured as follows:
X = c(1.1,2.3,3.4,4.5,5.8), Y = c(1.0,2.4,3.3,4.7,6.0) and the model would be like lm(Y.train~X.train) and the predictions would be like predict(lm, data = test_set). How would I set this up? Is there a better approach?
Thanks!

Set up a list or two to store the fitted model and the predictions. Then loop.
In the code below, you'll need to fill in the ...'s since I don't know what your data look like or how you are deciding what will differ between each iteration.
model_list <- vector(mode="list", length=N)
predict_list <- vector(mode="list", length=N)
for (i in 1:N) {
# fit model
model_list[[i]] <- lm(y ~ x, data=subset(...))
# store predictions
predict_list[[i]] <- predict(model_list[[i]], newdata=subset(...))
}

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

Predicting new values in jags (mixed model)

I asked a similar question a while ago on how to get model predictions in JAGS for mixed models. Here's my original question.
This time, I'm trying to get predictions for the same model but using new data and not the original that was used to fit the model.
model<-"model {
# Priors
mu_int~dnorm(0, 0.0001)
sigma_int~dunif(0, 100)
tau_int <- 1/(sigma_int*sigma_int)
for (j in 1:(M)){
alpha[j] ~ dnorm(mu_int, tau_int)
}
beta~dnorm(0, 0.01)
sigma_res~dunif(0, 100)
tau_res <- 1/(sigma_res*sigma_res)
# Likelihood
for (i in 1:n) {
mu[i] <- alpha[Mat[i]]+beta*Temp[i] # Expectation
D47[i]~dnorm(mu[i], tau_res) # The actual (random) responses
}
for(i in 1:(n)){
D47_pred[i] <- dnorm(mu[i], tau_res)
}
}"
I know this mcan be done using the posterior distributions of the resulting parameters but I'm wondering if it could also be implemented inside jags.
Thank you!
It absolutely could be done inside JAGS. If you wanted predictions for new values of Temp for some of the same observations in Mat, you would just have to append them to the existing data with a corresponding D47 value of NA.

Create model.matrix() from LASSO output

I wish to create a model matrix of the independent variables/specific levels of categorical variables selected by LASSO so that I can plug said model matrix into a glm() function to run a logistic regression.
I have included an example of what I'm trying to do. Any help would be greatly appreciated
data("iris")
iris$Petal.Width <- factor(iris$Petal.Width)
iris$Sepal.Length2 <- ifelse(iris$Sepal.Length>=5.8,1,0)
f <- as.formula(Sepal.Length2~Sepal.Width+Petal.Length+Petal.Width+Species)
X <- model.matrix(f,iris)[,-1]
Y <- iris$Sepal.Length2
cvfit <- cv.glmnet(X,Y,alpha=1,family="binomial")
fit <- glmnet(X,Y,alpha=1,family = "binomial")
b <- coef(cvfit,s="lambda.1se")
print(b)
## This is the part I am unsure of: I want to create a model matrix of the non-zero coefficients contained within 'b'
## e.g.
lasso_x <- model.matrix(b,iris)
logistic_model <- glm.fit(lasso_x,Y,family = "binomial")
Edit:
I also tried the following:
model.matrix(~X)[which(b!=0)-1]
but it just gives me a single column of 1's, the length of the number of selections from LASSO (minus the intercept)

Get list of R-squared values for linear regression model as we incrementally add predictors

I have a regression that predicts y based on 14 x-values (x1 through x14). I want to write a loop that does a regression where each iteration of the loop adds one more predictor to the regression, then tells me what the r-squared is. Here is my code:
rsqvals <- rep(NA, 15)
for (i in 1:15){
simtemp2 <- simdata[, 1:i]
modeL <- lm(y ~ ., data=simtemp2)
rsqvals[i] <- summary(modeL)$r.squared
}
where simdata is my data frame and simtemp2 is the columns I want. I suspect the problem has something to do with the fact that I can't type simdata[, 1:i], but I'm not sure why not. Any help appreciated!
It looks like you are subsetting the data.frame too much on the first iteration. In your first iteration, you would get simtemp2 <- simdata[,1:1]. The result of this operation is a vector in simtemp2. Even if you convert simtemp2 back into a data.frame, lm() will not like it as a parameter. Try starting at 2 and see if this works:
rsqvals <- rep(NA, 15)
interceptonly <- lm(y~1,data=simdata) ### no features, only the intercept
### this isn't statistically meaningful, but I put it here for completeness
rsqvals[1] <- summary(interceptonly)$r.squared
for (i in 2:15){
simtemp2 <- simdata[, 1:i]
modeL <- lm(y ~ ., data=simtemp2)
rsqvals[i] <- summary(modeL)$r.squared
}
print(rsqvals)

Multivariate regression with glm: logical subscript too long

I am teaching myself multivariate regression and I am trying to simulate a multivariate random variable and construct a generalized linear model to fit it.
Here is my code:
#Clear Previous
rm(list=ls())
cmp = 2 #Number of components in sample
n = 10 #Number of simulated data points
B = matrix(c(1,2,3,4), nrow=2,byrow=TRUE) #Coefficient matrix
#Simulate model
X = matrix(rep(0,2*n), nrow=2,byrow=TRUE) #Initiate independent matrix
Y = matrix(rep(0,2*n), nrow=2,byrow=TRUE) #Initiate response matrix
for (j in 1:cmp){
X[j,] = rnorm(n) #independent data
e = rnorm(n) #error term
Y[j,] = B[j,1]+ B[j,2]*X[j,] + e
}
#Linear Regression
fit = glm(Y~X,family = gaussian())
fit
This produces the following error in the function glm:
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
I am quite unsure what the problem is.
Multivariate GLM
GLM is not working with multiple dependent variables. You can relate a single column like the code below but you can not do both. It is the independent data that can be multivariate.
Use Y[1,] instead of Y
fit = glm(Y[1,]~t(X),family = gaussian())
In addition, the above line uses the transpose t(X) instead of X because the function GLM will interpret the rows as different measurements.
MANOVA/MANCOVA / linear discriminant analysis
in your case, you seem to be using Gaussian distributed errors. For this particular case, there is a method that handles multiple dependent variables. It is MANOVA (if the independent variable is a factor) or MANCOVA (if the independent variable is continuous). You can model it in R as fit = manova(t(Y)~t(X))

Resources