building multiple models per group - r

I'd like to group my data, then build two linear models per group, gather the results, and use broom to summarize the model parameters, but I'm having an infinite recursion error that I cant seem to understand. Here's the code:
library(dplyr)
library(tidyr)
library(broom)
mtcars %>%
group_by(am) %>%
dplyr::do(simple_fit = lm(mpg ~ disp, data = .),
complex_fit = lm(mpg ~ disp + hp, data = .)) %>%
ungroup()
gather(model_type, model, -am) %>%
broom::tidy(model)
which results in this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
There are only 4 models in this example, so I don't understand why I'm hitting such a deep nested loop?

I found a comment on the github that fixed my issue here
Fixed version of the code is as follows:
mtcars %>%
group_by(am) %>%
dplyr::do(simple_fit = lm(mpg~disp, data = .),
complex_fit = lm(mpg ~ disp + hp, data = .)) %>%
ungroup() %>%
gather(model_type, model, -am) %>%
rowwise() %>%
broom::tidy(model)

Related

How to run ggpredict() in a loop following multiple regression models?

The aim is to get the output of the predicted probabilities of several regression models. First i run several regression models using the following code:
library(dplyr)
library(tidyr)
library(broom)
library(ggeffects)
mtcars$cyl=as.factor(mtcars$cyl)
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = tidy(lm(mpg ~ wt + gear + am , data = .), conf.int=TRUE)) %>%
gather(model_name, model, -cyl) %>% ## make it long format
unnest()
I would like to get the predicted probabilities of my predictor weight (wt). If i want to run the code manually for each different cylinder (cyl), it will look as the following:
#Filter by number of cylinders
df=filter(mtcars, cyl==4)
#Save the regression
mod= lm(mpg ~ wt + gear + am, data = df)
#Run the predictive probabilities
pred <- ggpredict(mod, terms = c("wt"))
This will be the code for only the first cylinder cyl==4, then we would have to run the same code for the second (cyl==6) and the third (cyl==8). This is a bit cumbersome. My aim is to automize that as i do for the regression analyses in the first code above. Also, I would like to get these results in the same format as the first code. In other words, they should be in a format that could be plotted afterwards. Can someone help me with that?
Rerun the models with ggpredict() on the inside:
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = ggpredict(lm(mpg ~ wt + gear + am, data= .), terms = c("wt"))) %>%
gather(model_name, model, -cyl) %>% unnest_legacy()
You can then plot wt (in the 'x' column) against 'predicted'. Note that you'll get a warning message on these data.

R two different code chunks to get a p-value but the code evaluates differently and I can't figure out the difference

I'm trying to figure out why these two code chunks give me different p-values for Welch's T-Test. I really just tried to do a tidy version of the base R code and create a table with both statistics. But the tidy version I'm using has a very small p-value and I'm confused as to why.
t.test(mpg ~ vs, data = mtcars) # p-value = 0.0001098
t.test(mpg ~ am, data = mtcars) # p-value = 0.001374
options(scipen = 999)
mtcars %>%
dplyr::select(mpg, vs, am) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:3) %>%
nest(data = -names) %>%
mutate(
test = map(data, ~ t.test(.x$mpg, .x$values)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied) # vs = 0.000000000000000010038009 and am = 0.000000000000000009611758
If you run simply:
t.test(mtcars$mpg, mtcars$vs)
You'll get the same values as in your nested data example.
So the issue is not the nesting - it's that you're performing a different kind of t-test. The formula version is treating the variables vs or am as having two groups (0, 1) and the vectorized version is not.

Grouped linear regression prediction on different grouped by group in R

I'm trying to build models based on specific groups in a dataset and use the models generated to predict fit on a different dataset by following the group restrictions. In other words, using the example below, models built using subset: cyl==4 of original data should be used only to predict subset: cyl==4 of new dataset (data1). Anyone can help with this interesting problem?
I tried to used data1%>% group_by(cyl) to specify the new data but that didn't help
Thank you
library(broom)
library(dplyr)
library(purrr)
data1 <- head(mtcars,20)
x<-mtcars %>%
group_by(cyl) %>%
summarise(fit = list(lm(wt ~ mpg)),
data = list(cur_data())) %>%
mutate(col = map(fit, augment, newdata = data1%>% group_by(cyl)))```
Here is a quick way to do this
library(dplyr)
models = mtcars %>% group_by(cyl) %>% do(model = lm(wt ~ mpg, data = .))
Then access the individual models with
library(broom)
tidy(models$model[[1]])
Another way to do the same -
models <- mtcars %>%
nest_by(cyl) %>%
mutate(mod = list(lm(mpg ~ disp, data = data)))

Trying to unnest broom::augment data, but R "can't cast"

I can't reproduce the data here, but I'm hoping I'm making an obvious mistake. I am trying to get residuals from all the of the models I fit with purrr::map. My code looks like this:
df %>%
group_by(group) %>%
nest() %>%
mutate(model = map(data, fit_mod),
model_data = map(model, broom::augment)) %>%
ungroup()%>%
unnest(c(data, model_data))
I get an error related to the title of one of the coefficients in my model:
Error: Can't cast model_data$poly.Actual_Population..2..raw...TRUE.
to model_data$poly.Actual_Population..2..raw...TRUE. .
How can I just pull those residuals out of there? I want to check my models.

run multiple model and save model comparison results in dataframe in r

I want to run lm models and save model comparison result and extract p-values. I would like to save all the info in a dataframe.
Using diamonds dataset as an example:
diamonds %>%
group_by(cut) %>%
do(model1 = lm(price~carat, data=.),
model2 = lm(price~carat+depth, data=.)) %>%
mutate(anova = anova(model2,model1)) %>%
mutate(pval= anova$'Pr(>F'[2])
I got error message below:
Error in mutate_impl(.data, dots) :
Column `anova` must be length 1 (the group size), not 6
My question is:
Why I got the error message and how to save anova result in the dataframe?
how to make the whole process work if lm or anova do not work on some subsets? something like try..catch..
My real data is more complicated then this. Just use diamonds and linear model to illustrate the idea.
Thanks a lot.
This is a really good application of the tidyr::nest() function in conjunction with purrr and broom. What you do is:
- Group the data frame
- Apply a model with mutate(mod = map(data, model)
- summarize the model using broom::tidy()
- extract the relevant statistics.
For more on this here's a great talk by Hadley on the subject: https://www.youtube.com/watch?v=rz3_FDVt9eg
In your case I think you can do something like this:
library(tidyverse)
library(broom)
diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model1 = map(data, ~lm(price~carat, data=.)),
model2 = map(data, ~lm(price~carat+depth, data=.))
) %>%
mutate(anova = map2(model1, model2, ~anova(.x,.y))) %>%
mutate(tidy_anova = map(anova, broom::tidy)) %>%
mutate(p_val = map_dbl(tidy_anova, ~.$p.value[2])) %>%
select(p_val)

Resources