Grouped linear regression prediction on different grouped by group in R - r

I'm trying to build models based on specific groups in a dataset and use the models generated to predict fit on a different dataset by following the group restrictions. In other words, using the example below, models built using subset: cyl==4 of original data should be used only to predict subset: cyl==4 of new dataset (data1). Anyone can help with this interesting problem?
I tried to used data1%>% group_by(cyl) to specify the new data but that didn't help
Thank you
library(broom)
library(dplyr)
library(purrr)
data1 <- head(mtcars,20)
x<-mtcars %>%
group_by(cyl) %>%
summarise(fit = list(lm(wt ~ mpg)),
data = list(cur_data())) %>%
mutate(col = map(fit, augment, newdata = data1%>% group_by(cyl)))```

Here is a quick way to do this
library(dplyr)
models = mtcars %>% group_by(cyl) %>% do(model = lm(wt ~ mpg, data = .))
Then access the individual models with
library(broom)
tidy(models$model[[1]])
Another way to do the same -
models <- mtcars %>%
nest_by(cyl) %>%
mutate(mod = list(lm(mpg ~ disp, data = data)))

Related

Multiple fixest_multi models and shape parameter - modelsummary package

Again, thanks to Laurent for answering questions and supporting the modelsummarypackage.
library(tidyverse)
library(fixest)
library(modelsummary)
fit<-mtcars %>% feols(c(mpg,hp )~1)
fit_1 <- mtcars %>% feols(c(mpg,hp,wt )~1)
fit_2 <- mtcars %>% feols(c(mpg,hp,gear, wt )~1)
modelsummary(c(fit, fit_1, fit_2), shape=model + statistic ~ term, output="flextable")
We obtain the long column of estimates from all 3 models (which are simple averages) as in:
So is there a way to rearrange the columns and the rows either using internal modelsummary functions or external work to the following format:
The biggest problem is moving the terms around so that they are aligned on the same line (note that the order of terms between fit_1 and fit_2 is changed) and the rest is filled with NA. Would really appreciate any help! It's a part of a larger problem I've been trying to solve unsuccessfully for the last 3 weeks.
One option is to output to a data frame and reshape manually:
library(tidyverse)
library(fixest)
library(modelsummary)
library(flextable)
fit<-mtcars %>% feols(c(mpg,hp )~1)
fit_1 <- mtcars %>% feols(c(mpg,hp,wt )~1)
fit_2 <- mtcars %>% feols(c(mpg,hp,gear, wt )~1)
models <- c(fit, fit_1, fit_2)
modelsummary(
models,
output = "dataframe",
shape = model + statistic ~ term) |>
mutate(
fit = c(rep(1, 4), rep(2, 6), rep(3, 8)),
model = trimws(model)) |>
pivot_wider(names_from = "fit", values_from = "(Intercept)") |>
select(-statistic, -part, ` ` = model) |>
flextable()
This is a very customized shape and modeling context, so I can't currently think of a way to achieve that purely using internal modelsummary functions arguments.

R two different code chunks to get a p-value but the code evaluates differently and I can't figure out the difference

I'm trying to figure out why these two code chunks give me different p-values for Welch's T-Test. I really just tried to do a tidy version of the base R code and create a table with both statistics. But the tidy version I'm using has a very small p-value and I'm confused as to why.
t.test(mpg ~ vs, data = mtcars) # p-value = 0.0001098
t.test(mpg ~ am, data = mtcars) # p-value = 0.001374
options(scipen = 999)
mtcars %>%
dplyr::select(mpg, vs, am) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:3) %>%
nest(data = -names) %>%
mutate(
test = map(data, ~ t.test(.x$mpg, .x$values)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied) # vs = 0.000000000000000010038009 and am = 0.000000000000000009611758
If you run simply:
t.test(mtcars$mpg, mtcars$vs)
You'll get the same values as in your nested data example.
So the issue is not the nesting - it's that you're performing a different kind of t-test. The formula version is treating the variables vs or am as having two groups (0, 1) and the vectorized version is not.

Object '.' not found while piping with dplyr

I am trying to conduct a survival curve using the survival package. The MWE code is as follows:
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(fit = .)
I get the error Error in eval(fit$call$data) : object '.' not found
When I try to break this down further by:
survfit <- df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .)
ggsurvplot(fit = survfit)
I get an identical error. Is anyone able to figure out how to pipe from my dataframe all the way through a survival curve? The reason I would like to do this is to streamline the filtering of my dataframe in order to produce a multitude of different survival curves without having to create many subsetted dataframes.
Apparently, ggsurvplot expects an object of class "survfit" as its first argument but also needs the data set as an argument.
The example below is based on the first example of function
survfit.formula {survival}.
library(dplyr)
library(survival)
library(survminer)
aml %>%
survfit(Surv(time, status) ~ x, data = .) %>%
ggsurvplot(data = aml)
In the question's case this would become
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(data = filter(df, fac <= "Limit"))

run multiple model and save model comparison results in dataframe in r

I want to run lm models and save model comparison result and extract p-values. I would like to save all the info in a dataframe.
Using diamonds dataset as an example:
diamonds %>%
group_by(cut) %>%
do(model1 = lm(price~carat, data=.),
model2 = lm(price~carat+depth, data=.)) %>%
mutate(anova = anova(model2,model1)) %>%
mutate(pval= anova$'Pr(>F'[2])
I got error message below:
Error in mutate_impl(.data, dots) :
Column `anova` must be length 1 (the group size), not 6
My question is:
Why I got the error message and how to save anova result in the dataframe?
how to make the whole process work if lm or anova do not work on some subsets? something like try..catch..
My real data is more complicated then this. Just use diamonds and linear model to illustrate the idea.
Thanks a lot.
This is a really good application of the tidyr::nest() function in conjunction with purrr and broom. What you do is:
- Group the data frame
- Apply a model with mutate(mod = map(data, model)
- summarize the model using broom::tidy()
- extract the relevant statistics.
For more on this here's a great talk by Hadley on the subject: https://www.youtube.com/watch?v=rz3_FDVt9eg
In your case I think you can do something like this:
library(tidyverse)
library(broom)
diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model1 = map(data, ~lm(price~carat, data=.)),
model2 = map(data, ~lm(price~carat+depth, data=.))
) %>%
mutate(anova = map2(model1, model2, ~anova(.x,.y))) %>%
mutate(tidy_anova = map(anova, broom::tidy)) %>%
mutate(p_val = map_dbl(tidy_anova, ~.$p.value[2])) %>%
select(p_val)

building multiple models per group

I'd like to group my data, then build two linear models per group, gather the results, and use broom to summarize the model parameters, but I'm having an infinite recursion error that I cant seem to understand. Here's the code:
library(dplyr)
library(tidyr)
library(broom)
mtcars %>%
group_by(am) %>%
dplyr::do(simple_fit = lm(mpg ~ disp, data = .),
complex_fit = lm(mpg ~ disp + hp, data = .)) %>%
ungroup()
gather(model_type, model, -am) %>%
broom::tidy(model)
which results in this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
There are only 4 models in this example, so I don't understand why I'm hitting such a deep nested loop?
I found a comment on the github that fixed my issue here
Fixed version of the code is as follows:
mtcars %>%
group_by(am) %>%
dplyr::do(simple_fit = lm(mpg~disp, data = .),
complex_fit = lm(mpg ~ disp + hp, data = .)) %>%
ungroup() %>%
gather(model_type, model, -am) %>%
rowwise() %>%
broom::tidy(model)

Resources