Naming conventions of predictor variables - r

Specific Examples:
log1 <- glm(Outcome ~ Predictor1 + Predictor2, family = binomial(link="logit"),
data=data)
log2 <- glm(data$Outcome ~ data$Predictor1 + data$Predictor2,
family = binomial(link="logit"))
These will produce the same models and their summaries will be identical.
Then why when using these models to predict an outcome from test data, do the values differ?
Example:
predict(log1,type = "response", newdata = test_dat) ==
predict(log2,type = "response", newdata = test_dat) = "FALSE"
I am not as familiar with R as I would like, but I can't seem to explain the differences. Help?

To compare two objects use identical(log1, log2) ; however, the problem is that the names are part of the objects so if the names are different then the objects cannot be identical even if all the numbers underlying them are the same.
For example, note how Time and BOD$Time are part of fm1 and fm2:
fm1 <- lm(demand ~ Time, BOD)
fm2 <- lm(BOD$demand ~ BOD$Time)
fm1[[1]]
## (Intercept) Time
## 8.521429 1.721429
fm2[[1]]
## (Intercept) BOD$Time
## 8.521429 1.721429

Related

How to loop over columns to evaluate different fixed effects in consecutive lme4 mixed models and extract the coefficients and P values?

I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275

R: modeling on residuals

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.
I'm interested in the following:
If I'm using a glm, how do I account for the link function used?
What distribution do I choose when running a second glm with var_2 as explanatory variable?
I assume this is related to 1.
Is this at all related to using the first models prediction as an offset in the second model?
My attempt:
dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]
# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]
# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]
My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!
In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.
The following regression is equivalent to this ANCOVA:
lm(y ~ treatment * scale(covariate, scale = FALSE))
Which applied to the data would look like this:
lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)
And can be turned into a glm similar to the one you use in your example:
glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE),
family=Gamma(link="log"),
data = mtcars)

Simple Way to Combine Predictions from Multiple Models for Subset Data in R

I would like to build separate models for the different segments of my data. I have built the models like so:
log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)
If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.
However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.
My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:
T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <- as.data.frame(allpred)
I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?
First, here's some sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T),
x2=rpois(100,10),
y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
x2=rpois(10,10))
Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1 from the model since it will be fixed for each subset
fits<-list(
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
)
Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset= parameter of each of the calls and evaluating those conditions in the test data.
whichsubset <- as.vector(sapply(fits, function(x) {
subsetparam<-x$call$subset
eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))
You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions
unsplit(
Map(function(a,b) predict(a,b),
fits, split(test, whichsubset)
),
whichsubset
)
And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.

Function for Logistic Regression Training Set

I am trying to create a function to test a logistic regression model developed on a training set.
For example
train <- filter(y, folds != i)
test <- filter(y, folds == i)
I want to be able to use the formula for different data sets.
For example, if I were to take y to be a response variable such as “low” in the birthwt data set and x to be the explanatory variables e.g. “age", “race” how would I implement these arguments into glm.train formula without having to type the function separately for different data sets ?
glm.train <- glm(y ~x, family = binomial, data = train)
You can use reformulate to create a formula based on strings:
x <- c("age", "race")
y <- "low"
form <- reformulate(x, response = y)
# low ~ age + race
Use this formula for glm:
glm.train <- glm(form, family = binomial, data = train)

Setting a formula for GLM as a sum of columns in R

I am trying to set the formula for GLM as the ensemble of columns in train - train$1:99:
model <- glm(train$100 ~ train$1:99, data = train, family = "binomial")
Can't figure to find the right way to do it in R...
If you need outcome ~ var1 + var2 + ... + varN, then try this:
# Name of the outcome column
f1 <- colnames(train)[100]
# Other columns seperated by "+"
f2 <- paste(colnames(train)[1:99], collapse = "+")
#glm
model <- glm(formula = as.formula(paste(f1, f2, sep = "~")),
data = train,
family = "binomial")
The simplest way, assuming that you want to use all but column 100 as predictor variables, is
model <- glm(v100 ~. , data = train, family = "binomial")
where v100 is the name of the 100th column (the name can't be 100 unless you have done something advanced/sneaky to subvert R's rules about data frame column names ...)

Resources