Write a function to list all possible combinations of models - r

I'm attempting to write a function to run all possible regression models for variables in a dataset. I was able to get it to run each variable, this is what I have so far.
library(tidyverse)
library(broom)
data("mtcars")
model1 <- function (DATA) {
DATA %>%
map(~lm(mpg ~ .x, data = DATA), tidy)%>% map(summary) %>%
map_dbl("adj.r.squared") %>%
tidy %>%
rename(adj.r.squared = x)
}
model1(mtcars)
I am new to R and writing functions so I am sure there are some issues with it. I want a tibble of all the adjusted r squared values for all possible models. How do I write a function that will do the same thing for two, three, or more variables?

I am not aware of any packages that allow one to automate this. So, let's try a brute force approach. The idea is to generate all possible combinations by hand and iterate over them.
vars <- names(mtcars)[-1]
models <- list()
for (i in 1:5){
vc <- combn(vars,i)
for (j in 1:ncol(vc)){
model <- as.formula(paste0("mpg ~", paste0(vc[,j], collapse = "+")))
models <- c(models, model)
}
}
You can use these formulas for run the linear model.
lapply(models, function(x) lm(x, data = mtcars))

Related

How do I use predictions from another model as feature in a tidymodels recipe?

I want to train a model via tidymodels using predictions from another model as feature. Specifically it`s a KNN model where I want to use predictions from a random forest model as a feature.
I started implementing a (hacky) solution using step_mutate, here it is:
library(dplyr)
library(tidymodels)
library(purrr)
library(data.table)
df <- data.table(
y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100)
)
pred_rf <- function(...) {
# Very hacky function which creating random_forest predictions
nms <- purrr::map_chr(rlang::enexprs(...), as.character)
l <- list(...)
dat <- setDT(l)
outcome <- names(dat)[1]
preds <- names(dat)[-1]
rec <- recipe(dat) %>%
update_role(!!outcome, new_role = "outcome") %>%
update_role(!!preds, new_role = "predictor")
model <- rand_forest(mode = "regression")
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(model)
fitted_model <- fit(wf, dat)
predictions <- predict(fitted_model, dat)$.pred
stopifnot(length(predictions) == nrow(dat))
stopifnot(sum(is.na(predictions)) == 0)
return(predictions)
}
rec <- recipe(y ~ ., df) %>%
step_mutate(y_pred = pred_rf(y, x1, x2)) %>%
prep()
bake(rec, new_data = NULL) # Desired output would be a design matrix like this
However I realised that would cause data-leakage when used for tuning. Is this possible to do without data leakage or would I need to create a custom step? It would be very similar to the step_impute_* functions, but I couldn`t find anything.
Thanks
The comment about data leakage is spot on. That is a huge concern (esp for a random forest model). This isn't an issue for imputation since the outcome variable for the original and imputation model are different.
We suggest re-framing the problem by including the original and KNN models (and others) in a stacking ensemble. That way, your other models can affect the outcome but are not inside another model. That may not be what you want, but I don't see any way to get there without significant overfitting.
As a side note, step_mutate() wouldn't work since the model doesn't persist. You would have to emulate the imputation steps to make sure that new samples can be processed with the recipe. The PLS and class distance steps are also good examples to emulate.

Storing model coefficients in a loop in R

This seems like a very basic problem, but I have not been able to find a solution. I essentially wish to run a linear regression in a for loop and store the model coefficients (and standard errors if possible) for each iteration in an csv file.
For reference, I am running Fama-MacBeth regressions on macroeconomic "shocks," (the residuals of macroeconomic factors regressed on their lagged values).
My code for the loop is as follows
for (i in 7:69){
model <- lm(data = data, data[[i]]~TM2R+IPR+InfR+UnR+OilR)
#Model coefficients
print(model$coefficients)
#Standard Errors in regression results
model$vcov <- vcovHC(model, type = "HC1")
print(model$vcov)
}
You can use the broom package to transform the output from lm() to a data frame, then append the results from vconvHC() to it.
Finally, you export as csv.
library(broom)
library(sandwich)
for (i in 7:69){
model_name <- paste0("model_", i, ".csv")
model_i <- lm(data = data, data[[i]]~TM2R+IPR+InfR+UnR+OilR)
tidy_model <- tidy(model_i)
tidy_model$vcov <- vcovHC(model_i, type = "HC1")
write.csv(tidy_model, file = model_name)
}

multiple imputation, lmer, and pooling ggeffects objects

I computed linear mixed effects models using lme4::lmer() on data that I multiply imputed using the mice package. On these lmer objects, I want to apply ggeffects::ggeffect() to get marginal effects that I can then plot for mean, +1sd and -1sd.
The pool_predictions function seems perfectly suited and does a great job for lm objects; however, for lmer objects the ggeffect() function does not work. ggpredict() for some reason works, but I want to get marginal, not conditional effects.
Here's a minimal reproducible example that I adapted from the pool_predictions() reference (the mixed model doesn't make sense, it's just to create an example):
if (!require("pacman")) install.packages("pacman")
pacman::p_load(mice,stats,lme4,ggeffects)
data("nhanes2")
#First, the working example from the pool_predictions() reference, using an lm object and ggpredict():
imp <- mice(nhanes2, printFlag = FALSE)
predictions1 <- lapply(1:5, function(i) {
m1 <- lm(bmi ~ age + hyp + chl, data = complete(imp, action = i))
ggpredict(m1, "age")
})
pool_predictions(predictions1)
#Now the same example, but using ggeffect() on the lm object, which also works:
predictions2 <- lapply(1:5, function(i) {
m2 <- lm(bmi ~ age + hyp + chl, data = complete(imp, action = i))
ggeffect(m2, "age")
})
pool_predictions(predictions2)
#It also seems to work for lmer objects, at least when using ggpredict():
predictions3 <- lapply(1:5, function(i) {
m3 <- lmer(bmi ~ age + chl + (1|hyp), data = complete(imp, action = i))
ggpredict(m3, "age")
})
pool_predictions(predictions3)
#But when I use ggeffect() instead of ggpredict(), this doesn't work anymore for lmer objects.
predictions4 <- lapply(1:5, function(i) {
m4 <- lmer(bmi ~ age + chl + (1|hyp), data = complete(imp, action = i))
ggeffect(m4, "age")
})
pool_predictions(predictions4)
Does anyone have an idea why this happens or has any tips how I can get the pooled marginal effects for my lmer object?
Thanks a lot!
Antje
I think this may be due to the way how data is retrieved from the environment, which fails for ggeffect() (which is based on the effects package). You could try ggemmeans() instead, which should give you the same results as ggeffect() does.

How to extract and modify individual p.table components from GAM fits in R

I have used a for loop to run a series of GAMs in R that regress a series of dependent variables on the same set of independent variables. I want to extract the p.table values from each model, but when I print the p.table objects from my list of model summaries, the p-values are absurdly long (~100 digits), and I cannot figure out how to apply a function to just that component of the p.table output while also printing the whole output.
Here is an example with mtcars. These model results are obviously meaningless; in this case, the p-values are printing fine, but in my data the p-values are way too long, and I want to truncate them in the printed output using, e.g., format.pval.
data(mtcars)
library(mgcv)
y_vars <- c("qsec", "wt", "hp")
models <- list()
for (i in y_vars){
models[[i]] <- gam(as.formula(paste(i, "~ cyl + s(drat) + am + gear + carb")),
method = "REML", data = mtcars)
}
models_summ <- lapply(models, summary)
lapply(models_summ, '[[', 'p.table')
I ended up assigning the output to a data frame and operating with it that way:
df <- data.frame(lapply(models_summ, '[[', 'p.table'))
If anyone has more elegant solutions, I would love to see them.

extract log rank (score) test result wiht p-value for Coxph Model

I have 100 replicates of coxph model fitted in loop. I am trying to extract out log-rank score test result with p-values for each replicate in a data frame or list. I am using the following. But, it gives me only log rank score, not p-value. Any help will be very appreciated.
I can share dataset, but am not sure how to attach here.
thanks,
Krina
Repl_List <- unique(dat3$Repl)
doLogRank = function(sel_name) {
dum <- dat3[dat3$Repl == sel_name,]
reg <- with(dum, coxph(Surv(TIME_day, STATUS) ~ Treatment, ties = "breslow"))
LogRank <- with(reg, reg$score)
}
LogRank <- t(as.data.frame(lapply(Repl_List, doLogRank)))
Here is a mock example that I took from the help page of the coxph function. I just replicated the dataset 100 times to create your scenario. I highly recommend to start using the tidyverse packages to do such work. broom is a great addition along with dplyr and tidyr.
library(survival)
library(tidyverse)
library(broom)
test <- data.frame(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
Below I am replicating the dataset 100 times using the replicate function.
r <- replicate(test,n = 100,simplify = FALSE) %>% bind_rows %>%
mutate(rep = rep(seq(1,100,1),each=7))
I setup the cox model as a small function that I can them pass on to each replicate of the dataframe.
cxph_mod <- function(df) {
coxph(Surv(time, status) ~ x + strata(sex), df)
}
Below, is the step by step process of fitting the model and extracting the values.
tidyr::nest the dataframe
purrr::map the model into each nest
nest is function in library(tidyr)
map is a function similar to lapply in library(purrr)
nested <- r %>%
group_by(rep) %>%
nest %>%
mutate(model = data %>% map(cxph_mod))
look into the first rep to see the coxph output. You will see the model object stored in the cells of the dataframe allowing easier access.
nested %>% filter(rep==1)
With each model object, now use broom to get the parameter estimates and the prediction from the model into the nested dataset
nested <- nested %>%
mutate(
ests = model %>% map(broom::tidy)
)
tidyr::unnest to view your predictions for fitting each resampled dataset
ests <- unnest(nested,ests,.drop=TRUE) %>% dplyr::select(rep,estimate:conf.high)
In this case since I am repeating the same dataset 100 times, the pvalue will be the same, but in your case you will have 100 different datasets and hence 100 different p.values.
ggplot(data=ests,aes(y=p.value,x=rep))+geom_point()
Vijay

Resources