How to repeat univariate regression and extract P values? - r

I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. but I'm not sure how to extract the P values at a time.
There are 200 features in my dataset, but the code below only gave me the P value of feature#1. How can I get a matrix of all P values of the 200 features?
valName<- as.data.frame(colnames(repeatData))
featureName<-valName[3,]
lapply(featureName,
function(var) {
formula <- as.formula(paste("outcome ~", var))
fit.logist <- glm(formula, data = repeatData, family = binomial)
summary(fit.logist)
Pvalue<-coef(summary(fit.logist))[,'Pr(>|z|)']
})

I
I simplified your code a little bit; (1) used reformulate() (not really different, just prettier) (2) returned only the p-value for the focal variable (not the intercept p-value). (If you leave out the 2, you'll get a 2-row matrix with intercept and focal-variable p-values.)
My example uses the built-in mtcars data set, with an added (fake) binomial response.
repeatData <- data.frame(outcome=rbinom(nrow(mtcars), size=1, prob=0.5), mtcars)
ff <- function(var) {
formula <- reformulate(var, response="outcome")
fit.logist <- glm(formula, data = repeatData, family = binomial)
coef(summary(fit.logist))[2, 'Pr(>|z|)']
}
## skip first column (response variable).
sapply(names(repeatData)[-1], ff)

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

Description of columns in `augment.merMod`()

I appreciate broom.mixed ability to capture mixed-effects modeling in nice tidy formats. In assessing assumptions for the linear mixed effects model, I am finding that the augment function is particularly useful. However, the documentation fails to state what all the columns are for augment.merMod().
library(lme4)
library(broom.mixed)
set.seed(101)
dd <- expand.grid(f1 = factor(1:3),
f2 = LETTERS[1:2], g=1:9, rep=1:15,
KEEP.OUT.ATTRS=FALSE)
summary(mu <- 5*(-4 + with(dd, as.integer(f1) + 4*as.numeric(f2))))
dd$y <- rnbinom(nrow(dd), mu = mu, size = 0.5)
m.nb <- glmer.nb(y ~ f1*f2 + (1|g), data=dd, verbose=FALSE)
head(augment(m.nb))
Here is what the documentation says:
augment returns one row for each original observation, with columns (each prepended by a .) added. Included are the columns
.fitted predicted values
.resid residuals
.fixed predicted values with no random effects
Also added for "merMod" objects, but not for "mer" objects, are values from the response object within the model (of type lmResp, glmResp, nlsResp, etc). These include ".mu", ".offset", ".sqrtXwt", ".sqrtrwt", ".eta".
What are these columns: ".mu", ".sqrtXwt", ".sqrtrwt", ".eta" ? Is .fitted the predicted values on the model scale? And .mu on the response scale (in other words, the inverse link function is applied to predicted values)?

Error: Variable length differs in lm regression using paste function

I have generated randomly a dataset that has been split in two (L and I).
First I run the regression on L using all the covariates.
After defining the set of variables that are significantly different form zero I want to run the regression on I using this set of variables.
reg_L = lm(y ~ ., data = data)
S_hat = as.data.frame(round(summary(reg_L)$coefficients[,"Pr(>|t|)"], 3)<0.05)
S_hat_L = rownames(which(S_hat==TRUE, arr.ind = TRUE))
Therefore here I want to run the new model that doesn't work only due to a problem in the specification of the variable x.
What am I doing wrong?
# Using the I proportion to construct the p-values
x = noquote(paste(S_hat_L, collapse = " + "))
reg_I = lm(y ~ x, data = data)
summary(reg_I)
A simpler way than trying to manipulate a formula programmatically would be to remove the unwanted predictors from the data:
wanted <- summary(fit)$coefficients[,"Pr(>|t|)"] < 0.05
reduced.data <- data[, wanted]
reg_S <- lm(y ~ ., data=reduced.data)
Note however, that it is more robust with respect to out-of-sample performance to reduce variables with the LASSO. This will yield a model that has some coefficients set to zero, but the other coefficients are adjusted in such a way that the uot-of-sample performance will be better.

Getting standardized coefficients for a glmer model?

I've been asked to provide standardized coefficients for a glmer model, but am not sure how to obtain them. Unfortunately, the beta function does not work on glmer models:
Error in UseMethod("beta") :
no applicable method for 'beta' applied to an object of class "c('glmerMod', 'merMod')"
Are there other functions I could use, or would I have to write one myself?
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
The model is as follows:
model=glmer(cbind(nr_corr,maximum-nr_corr) ~ (condition|SUBJECT) + categorical_1 + categorical_2 + continuous_1 + continuous_2 + continuous_3 + continuous_4 + categorical_1:categorical_2 + categorical_1:continuous_3, data, control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=100000)), family = binomial)
reghelper::beta simply standardizes the numeric variables in our dataset. So assuming your catagorical variables are factors rather than numeric dummy variables or other contrast encodings we can fairly simply standardize the numeric variables in our dataset
vars <- grep('^continuous(.*)?', all.vars(formula(model)))
f <- function(var, data)
scale(data[[var]])
data[, vars] <- lapply(vars, f, data = data)
update(model, data = data)
Now for the more general case we can more or less just as easily create our own beta.merMod function. However we will need to take into account whether or not it makes sense to standardize y. For example if we have a poisson model only positive integer values makes sense. In addition a question becomes whether or not to scale the random slope effects or not, and whether it makes sense to ask this question in the first place. In it I assume that categorical variables are encoded as character or factor and not numeric or integer.
beta.merMod <- function(model,
x = TRUE,
y = !family(model) %in% c('binomial', 'poisson'),
ran_eff = FALSE,
skip = NULL,
...){
# Extract all names from the model formula
vars <- all.vars(form <- formula(model))
lhs <- all.vars(form[[2]])
# Get random effects from the
ranef <- names(ranef(model))
# Remove ranef and lhs from vars
rhs <- vars[!vars %in% c(lhs, ranef)]
# extract the data used for the model
env <- environment(form)
call <- getCall(model)
data <- get(dname <- as.character(call$data), envir = env)
# standardize the dataset
vars <- character()
if(isTRUE(x))
vars <- c(vars, rhs)
if(isTRUE(y))
vars <- c(vars, lhs)
if(isTRUE(ran_eff))
vars <- c(vars, ranef)
data[, vars] <- lapply(vars, function(var){
if(is.numeric(data[[var]]))
data[[var]] <- scale(data[[var]])
data[[var]]
})
# Update the model and change the data into the new data.
update(model, data = data)
}
The function works for both linear and generalized linear mixed effect models (not tested for nonlinear models), and is used just like other beta functions from reghelper
library(reghelper)
library(lme4)
# Linear mixed effect model
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm2 <- beta(fm1)
fixef(fm1) - fixef(fm2)
(Intercept) Days
-47.10279 -19.68157
# Generalized mixed effect model
data(cbpp)
# create numeric variable correlated with period
cbpp$nv <-
rnorm(nrow(cbpp), mean = as.numeric(levels(cbpp$period))[as.numeric(cbpp$period)])
gm1 <- glmer(cbind(incidence, size - incidence) ~ nv + (1 | herd),
family = binomial, data = cbpp)
gm2 <- beta(gm1)
fixef(gm1) - fixef(gm2)
(Intercept) nv
0.5946322 0.1401114
Note however that unlike beta the function returns the updated model not a summary of the model.
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
Now that is a great question and one better suited for stats.stackexchange, and not one I'm certain of the answer to.
Again, thank you so much, Oliver! For anybody who is interested in the answer regarding the last part of my question,
Another problem is that the model contains several continuous
predictors (which operate on similar scales) and 2 categorical
predictors (one with 4 levels, one with six levels). The purpose of
using the standardized coefficients would be to compare the impact of
the categorical predictors to those of the continuous ones, and I'm
not sure that standardized coefficients are the appropriate way to do
so. Are standardized coefficients an acceptable approach?
you can find the answer here. The tl;dr is that using standardized regression coefficients is not the best approach for mixed models anyways, let alone one such as mine...

Using weights for repeated cases in R (and specifically gam for binary response)

I've noticed many R models allow a "weights" parameter (e.g. cart, loess, gam,...). Most of the help functions describe it as "prior weights" for the data, but what does that actually mean?
I have data with many repeated cases and a binary response. I was hoping I could use "weights" to encode how many times each combination of input and response occurs, but this doesn't seem to work. I've also tried making the response the proportion of successes, and the weight the total trials for each combination of covariates, but this doesn't seem to work either (at least for gam). I'm trying to do this for all of the model types listed above, but for starters, how to do this for gam [mgcv package]?
Weights for a binomial response have a natural interpretation: the number of trials corresponding to each observation. If you have n trials of which p are successes, you fit this with
glm(p/n ~ x, family=binomial, weights=n)
The same works with gam in both the gam and mgcv packages.
I also used to think the weights were a convenient way of encoding sample sizes for repeated observations. But the following example shows that this is not the case for a simple linear model. I first define a contingency table with observed/invented shoe sizes and heights of people and fit a leats squares regression specifying the frequencies as the weights:
SKdata = matrix(c(20,5,5,5,40,15,3,27,30,2,3,10),ncol=4)
dimnames(SKdata) = list(shoesize=10:12,height=seq(160,190,by=10))
x = as.data.frame(as.table(SKdata), stringsAsFactors=FALSE)
for (i in 1:ncol(x)) x[,i] = as.numeric(x[,i])
fit1 = lm(height ~ shoesize,data=x, weights=Freq)
summary(fit1)
Notice that the coefficient for the slope is non significant and the residual error is based on "10 degrees of freedom"
This changes when I convert the contingency table into the "raw" data, meaning one row per observation, with the convenience function expand.dft:
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ],
simplify = FALSE)
DF <- subset(do.call("rbind", DF), select = -Freq)
for (i in 1:ncol(DF))
{
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
DF
}
fit2 = lm(height ~ shoesize,data=expand.dft(x))
summary(fit2)
We obtain the identical coefficient but this time highly significant as based on "163 degrees of freedom"

Resources