Extracting model reliability from multiple GAMs applied across a dataframe

Extracting model reliability from multiple GAMs applied across a dataframe - r

I have been able to apply a General Additive Model iteratively across a dataframe, so where sp_a is the response variable...
sp_a <- rnorm (100, mean = 3, sd = 0.9)
var_env_1 <- rnorm (100, mean = 1, sd = 0.3)
var_env_2 <- rnorm (100, mean = 5, sd = 1.6)
var_env_3 <- rnorm (100, mean = 10, sd = 1.2)
data <- data.frame (sp_a, var_env_1, var_env_2,var_env_3)
library(mgcv)
Gam <- lapply(data[,-1], function(x) summary(gam(data$sp_a ~ s(x))))
This creates a GAM between the response variable and each explanatory variable iteratively. However, how I would then extract p values or the s.pv from each model. Does anybody know how to do this? Also, it would be great to rank them by their AIC score like this...
Gam1 <- gam(sp_a ~ s(var_env_1))
Gam2 <- gam(sp_a ~ s(var_env_2))
Gam3 <- gam(sp_a ~ s(var_env_3))
AIC(Gam1,Gam2,Gam3)
But selecting this from the original 'Gam' output instead. Thank you for any help in advance.

In the end, it was evident I had to remove the summary option, that then allowed me to calculate AIC score for all models. Other interesting ways of formatting can be found here Using lapply on a list of models, as these functions work for different kinds of models (e.g. lm, glm).
Gam <- lapply(data[,-1], function(x) gam(data$sp_a ~ s(x)))
sapply(X = Gam, FUN = AIC)

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.

I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).

Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

Is there a way in R to create a linear combination of coefficients from different lm models?

Unfortunately, I can't provide the data I'm working with, but I don't think it's necessary to solve the problem. My question is, I think, the same one this guy had 5 years ago (R: testing linear combination of coefficients from multiple regressions with plm), but back then it went unanswered, so I'll try my luck.
Essentially, as the title says, I want to test for a linear combination of coefficients. This sounds very straightforward and indeed there are several packages to do so (glht, lincom, etc.). The only problem is that all of them take as one of the arguments a single lm (or glm, etc.) model, whereas I am estimating several models and I want to compute point estimates, se, etc. from a linear combination of coefficients of different previously estimated models.
In Stata, this is done with a simple trick which is running first suest command which takes as argument model names and combines them in one single object, and then running lincom as one would do in R. I have been searching the internet for a while and couldn't find anything, do you know how to achieve this in R?
Here is a more concrete example of what I want to do, I hope it helps.
library(data.table)
library(estimatr)
library(foreign)
library(multcomp)
df <- data.table(y1 = runif(100, 0, 100),
y2 = runif(100, 0, 100),
y3 = runif(100, 0, 100),
x = runif(100, 0, 100),
z = runif(100, 0, 100),
id = round(runif(100, 0, 100)/10)*10)
lm1 <- lm_robust(formula = y1 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
lm2 <- lm_robust(formula = y2 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
lm3 <- lm_robust(formula = y3 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
# linear combination I can for example do
summary(glht(lm1, linfct = "x + z = 0"))
# linear combination I would like to do
# 0.1*z_lm1 + 0.4*z_lm2 + 0.2*z_lm3 = 0, where the numbers before the coefficients are weights defined somewhere else
I now read in the glht documentation that by passing to the function a list of coefficients and a covariance matrix it should be able to ignore the "model" argument and thus let me compute a linear combination of coefficients from different models. However, I still have no idea how one would go about computing a covariance matrix using results from different models.

Using coeftest results in predict.lm()

I am analyzing a dataset in which the variance of the error term in the regression is not constant for all observations. For this, I re-built the model, estimating heteroskedasticity-robust (Huber-White) standard errors using the coeftest function. Now, I want to use these new results for a prediction with predict() function.
The dataset looks like the following but with multiple X:
set.seed(123)
x <- rep(c(10, 15, 20, 25), each = 25)
e <- c()
e[1:25] <- rnorm(25, sd = 10)
e[26:50] <- rnorm(25, sd = 15)
e[51:75] <- rnorm(25, sd = 20)
e[76:100] <- rnorm(25, sd = 25)
y <- 720 - 3.3 * x + e
model <- lm(y ~ x)
library(lmtest)
library(sandwich)
coeftest(model, vcov=vcovHC(model, "HC1"))
I found the following solution for the issue on the internet:
predict.rob <- function(x,vcov,newdata){
if(missing(newdata)){ newdata <- x$model }
tt <- terms(x)
Terms <- delete.response(tt)
m.mat <- model.matrix(Terms,data=newdata)
m.coef <- x$coef
fit <- as.vector(m.mat %*% x$coef)
se.fit <- sqrt(diag(m.mat%*%vcov%*%t(m.mat)))
return(list(fit=fit,se.fit=se.fit))}
The remaining problem is that my regression has more than 1 regressor.
Is there any way to addapt this resolution to multiple (7) explanatory variables?
Thanks in advance!

I'm not sure but coeftest function is only performing a test. You can't use directly its result for your prediction. Maybe, you can in a way specifify to predict.lm the covaraince via vcovHC(model, "HC1"). I hope it will help you a bit.

Obtaining Standardized coefficients from "rstanarm" package in R?

I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)

I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.

Plotting binomial glm with interactions in numeric variables

I want to know if is possible to plotting binomial glm with interactions in numeric variables. In my case:
##Data set artificial
set.seed(20)
d <- data.frame(
mating=sample(0:1, 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
#Binomial GLM ajusted
model<-glm(mating ~ behv + condition, data=d, family=binomial)
summary(model)
In a situation where behv and condition are significant in the model
#Plotting first for behv
x<-d$behv ###Take behv values
x2<-rep(mean(d$condition),length(d_p[,1])) ##Fixed mean condition
# Points
plot(d$mating~d$behv)
#Curve
curve(exp(model$coefficients[1]+model$coefficients[2]*x+model$coefficients[3]*x2)
/(1+exp(model$coefficients[1]+model$coefficients[2]*x+model$coefficients[3]*x2)))
But doesn't work!! There is another correct approach?
Thanks

It seems like your desired output is a plot of the conditional means (or best-fit line). You can do this by computing predicted values with the predict function.
I'm going to change your example a bit, to get a nicer looking result.
d$mating <- ifelse(d$behv > 0, rbinom(200, 1, .8), rbinom(200, 1, .2))
model <- glm(mating ~ behv + condition, data = d, family = binomial)
summary(model)
Now, we make a newdata dataframe with your desired values:
newdata <- d
newdata$condition <- mean(newdata$condition)
newdata$yhat <- predict(model, newdata, type = "response")
Finally, we sort newdata by the x-axis variable (if not, we'll get lines that zig-zag all over the plot), and then plot:
newdata <- newdata[order(newdata$behv), ]
plot(newdata$mating ~ newdata$behv)
lines(x = newdata$behv, y = newdata$yhat)
Output: