grouped regresion in R - r

I am trying to run a multiple linear regression but i am getting the same coefficients for all my grouped variables
names<- rep(LETTERS[1:25], each = 20)
daysp<- runif(1:500,1,500)
startdate <-sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 500)
enddate<- sample(seq(as.Date('2010/01/01'), as.Date('2020/01/01'), by="day"), 500)
class <- rep(LETTERS[1:4], each = 125)
amt<- runif(1:500,10000,500000)
2ndclass <- rep(LETTERS[5:8], each = 125)
df<-data.frame(names,daysp,startdate,enddate,class,amt,2ndclass)
Changed to factor class and 2ndclass
fitted_models = df %>% group_by(names) %>% do(model = lm(daysp ~ startdate + enddate
+ class + 2ndclass + amt, data=df))
fitted_models$models
How can i run the regressions and get different coefficients for each group?

data = df explicitly uses the entire data frame df, ignoring any grouping. Use . to refer to the data that is piped in, which will let do use the groups. See the example at the bottom of ?do for reference:
## From ?do
by_cyl <- mtcars %>% group_by(cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
Though, versions of dplyr > 1.0 will prefer using nest_by (also demonstrated on the ?do help page):
models <- mtcars %>%
nest_by(cyl) %>%
mutate(mod = list(lm(mpg ~ disp, data = data)))
models %>% summarise(broom::tidy(mod))

Related

How can I replicate plot.lda() with of R `tidymodels`

I would like to replicate the plot.lda print method using ggplot2 and tidymodels. Is there an elegant way to get the plot?
I think I can fake the augment() function, which does not have a lda method, by using predict() and bind it onto the original data.
Here is an example with the base R and tidymodels code:
library(ISLR2)
library(MASS)
# First base R
train <- Smarket$Year < 2005
lda.fit <-
lda(
Direction ~ Lag1 + Lag2,
data = Smarket,
subset = train
)
plot(lda.fit)
# Next tidymodels
library(tidyverse)
library(tidymodels)
library(discrim)
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
the_rec <- recipe(
Direction ~ Lag1 + Lag2,
data = Smarket
)
the_workflow<- workflow() %>%
add_recipe(the_rec) %>%
add_model(lda_spec)
Smarket_train <- Smarket %>%
filter(Year != 2005)
the_workflow_fit_lda_fit <-
fit(the_workflow, data = Smarket_train) %>%
extract_fit_parsnip()
# now my attempt to do the plot
predictions <- predict(the_workflow_fit_lda_fit,
new_data = Smarket_train,
type = "raw"
)[[3]] %>%
as.vector()
bind_cols(Smarket_train, .fitted = predictions) %>%
ggplot(aes(x=.fitted)) +
geom_histogram(aes(y = stat(density)),binwidth = .5) +
scale_x_continuous(breaks = seq(-4, 4, by = 2))+
facet_grid(vars(Direction)) +
xlab("") +
ylab("Density")
There must be a better way to do this.... thoughts?
You can do this by using a combination of extract_fit_*() and parsnip:::repair_call(). The plot.lda() method uses the $call object in the LDA fit, which we need to adjust since the call object from using tidymodels will be different than using lda() directly.
library(ISLR2)
library(MASS)
# First base R
train <- Smarket$Year < 2005
lda.fit <-
lda(
Direction ~ Lag1 + Lag2,
data = Smarket,
subset = train
)
# Next tidymodels
library(tidyverse)
library(tidymodels)
library(discrim)
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
the_rec <- recipe(
Direction ~ Lag1 + Lag2,
data = Smarket
)
the_workflow <- workflow() %>%
add_recipe(the_rec) %>%
add_model(lda_spec)
Smarket_train <- Smarket %>%
filter(Year != 2005)
the_workflow_fit_lda_fit <-
fit(the_workflow, data = Smarket_train)
After fitting both models, we can inspect the $call objects and we see that they are different.
lda.fit$call
#> lda(formula = Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
extract_fit_engine(the_workflow_fit_lda_fit)$call
#> lda(formula = ..y ~ ., data = data)
The parsnip::repair_call() function will replace data with the data we pass in. Additionally, we will rename the response of the data to ..y to match the call.
the_workflow_fit_lda_fit %>%
extract_fit_parsnip() %>%
parsnip::repair_call(rename(Smarket_train, ..y = Direction)) %>%
extract_fit_engine() %>%
plot()
Created on 2021-11-12 by the reprex package (v2.0.1)

Fable TSLM different models yield identical predictions

I'm trying to fit and forecast TSLM models with different time-t predictors added alongside the trend... for reasons I don't understand, several of the models yield identical predictions even though the the input test data appears different, and the coefficients in the models themselves appear different. Almost certainly this an error on my part.. let me know what's going wrong!
Reprex data
suppressPackageStartupMessages({
library(tidyverse)
library(tsibble)
library(fable)
library(feasts)
})
proj_tract <- read_csv("path_to_reprexdata")
proj_tract <- as_tsibble(proj_tract, key = tractid, index = year)
train <- proj_tract %>%
filter(year < 2019)
test <- proj_tract %>%
filter(year >= 2019)
fit <- train %>%
model(
trend_only = TSLM(log(chh) ~ trend()),
trend_w_dar = TSLM(log(chh) ~ trend() + log(ig_count_imptd)),
trend_w_da1 = TSLM(log(chh) ~ trend() + log(prd_1)),
trend_w_da2 = TSLM(log(chh) ~ trend() + log(prd_2)),
trend_w_da3 = TSLM(log(chh) ~ trend() + log(prd_3)),
trend_w_da4 = TSLM(log(chh) ~ trend() + log(prd_4)),
trend_w_da5 = TSLM(log(chh) ~ trend() + log(prd_glmnet))
)
fc <- forecast(
fit,
new_data = test
) %>%
hilo(.95)
res <- fc %>%
as_tibble() %>%
rename("proj" = ".mean", "model" = ".model") %>%
select(model, proj, lchh) %>%
pivot_wider(names_from = model, values_from = proj)
head(res)
A subset of these models yield identical predictions -- help me understand why!

Extract slope and r squared from grouped linear models using broom

I have a dataframe that I want to run linear models on by group, then use the broom package to extract the slope and r squared for each model. So far I am trying this:
library(tidyverse)
library(broom)
#read in the dataset
data(mtcars)
#add a group variable
mtcars <- mtcars %>% as_tibble() %>% mutate(LC = 1)
#create a second group
mtcars2 <- mtcars
mtcars2 <- mtcars2 %>% mutate(LC = 2)
#bind together
mtcars <- rbind(mtcars, mtcars2)
#groupby and run regressions
all_regress <- mtcars %>% group_by(LC) %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .))
#use broom the extract the slope and rsq per group
glance <-all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
but this fails with:
Error: Problem with `mutate()` input `tidy`.
x No tidy method for objects of class qr
ℹ Input `tidy` is `map(mod1, broom::tidy)`.
ℹ The error occurred in row 1.
If I do this without groups such as:
#read in the dataset
data(mtcars)
mtcars <- mtcars %>% as_tibble()
#run regressions
all_regress <- mtcars %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .))
#use broom the extract the slope and rsq per group
glance <- all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
there is no error.
I think simply adding ungroup() achieves what you need:
all_regress <- mtcars %>% group_by(LC) %>%
do(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .)) %>% ungroup()
#use broom the extract the slope and rsq per group
glance <-all_regress %>% mutate(tidy = map(mod1, broom::tidy),
glance = map(mod1, broom::glance),
augment = map(mod1, broom::augment),
rsq = glance %>% map_dbl('r.squared'),
slope = tidy %>% map_dbl(function(x) x$estimate[2]))
I used this approach, its longer but i think theres more control in the individual steps. Finally i created a tibble with lists columns containing each model.
library(tidyverse)
library(broom)
#read in the dataset
data(mtcars)
#add a group variable
mtcars <- mtcars %>% as_tibble() %>% dplyr::select(-c(vs, am, gear, carb, cyl)) %>% mutate(LC = 1)
#create a second group
mtcars2 <- mtcars
mtcars2 <- mtcars2 %>% mutate(LC = 2)
#bind together
mtcars <- bind_rows(mtcars2, mtcars)
#group_split and run regressions
all_regress <- mtcars %>% group_split(LC) %>%
map(~ list(mod1 = lm(mpg ~ disp, data = .),
mod2 = lm(mpg ~ wt, data = .)))
# example <- all_regress[[2]][[1]] %>% glance()
#the list has 2 levels with 2 models each
data <- all_regress %>%
map(~
map(.x, function(model){
#column lists are needed because each function output different objects
tibble(mod = list(model),
tidy = list(broom::tidy(model)),
glance = list(broom::glance(model)),
augment = list(broom::augment(model))) %>%
mutate(
rsq = list(glance[[1]]$r.squared),
slope = list(tidy[[1]]$estimate[2]))
} ))
data_final <-
data %>% map2(unique(mtcars$LC), ~
map2(.x, .y, function(each_model, lc){
mutate(each_model, LC = lc)
}))
final_format <- #because of the list structure i need to bind the two datasets in each level and then bind them again.
map(data_final, ~reduce(.x, rbind)) %>% reduce(rbind)
#acces the data
final_format[1, 1][[1]]

How to use augment with a model on new data

It is fairly straightforward to use the augment function from the Broom package in R to add predictions back into a tibble. Viz.
df <- iris %>%
nest(data = everything()) %>%
mutate(model = map(data, function(x) lm(Sepal.Length ~ Sepal.Width, data = x)),
pred = map2(model, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
However, when I take a linear model trained on one set of data and try and predict on new data I receive the following error.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y)))
# Error: Problem with `mutate()` input `pred`.
# x No augment method for objects of class numeric
# i Input `pred` is `map2(mod, data, ~augment(.x, newdata = .y))`.
How should I use augment to fit new data? Is using an external model object (in the example above this is mod) the best practice or is there a more elegant way?
Since there is only one model we can do this without using map.
library(dplyr)
df1 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
tidyr::nest(data = everything()) %>%
summarise(pred = broom::augment(mod, newdata = data[[1]]),
mod = list(mod),
data = data)
Having just posted the question, I think I have an answer. I won't accept the answer for 48 hours just in case someone contradicts or provides a more comprehensive one.
In the example, map2 expects mod as a vector or list but it is a model object. Putting mod into the tibble as a list object suppresses the error and correctly calculates predictions.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(mod = list(mod)) %>% #! this is the additional step
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
Alternatively, coerce the external model object as list.
...
mutate(pred = map2(list(mod), data, ~augment(.x, newdata = .y))) %>%
...

dplyr, do(), extracting parameters from model without losing grouping variable

A slightly changed example from the R help for do():
by_cyl <- group_by(mtcars, cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
coefficients<-models %>% do(data.frame(coef = coef(.$mod)[[1]]))
In the dataframe coefficients, there is the first coefficient of the linear model for each cyl group. My question is how can I produce a dataframe that contains not only a column with the coefficients, but also a column with the grouping variable.
===== Edit: I extend the example to try to make more clear my problem
Let's suppose that I want to extract the coefficients of the model and some prediction. I can do this:
by_cyl <- group_by(mtcars, cyl)
getpars <- function(df){
fit <- lm(mpg ~ disp, data = df)
data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
}
getprediction <- function(df){
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
data.frame(x,y)
}
pars <- by_cyl %>% do(getpars(.))
prediction <- by_cyl %>% do(getprediction(.))
The problem is that the code is redundant because I am fitting the model two times. My idea was to build a function that returns a list with all the information:
getAll <- function(df){
results<-list()
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
results$pars <- data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
results$prediction <- data.frame(x,y)
results
}
The problem is that I don't know how to use do() with the function getAll to obtain for example just a dataframe with the parameters (like the dataframe pars).
Like this?
coefficients <-models %>% do(data.frame(coef = coef(.$mod)[[1]], group = .[[1]]))
yielding
coef group
1 40.87196 4
2 19.08199 6
3 22.03280 8
Using the approach of Hadley Wickham in this video:
library(dplyr)
library(purrr)
library(broom)
fitmodel <- function(d) lm(mpg ~ disp, data = d)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mod = map(data, fitmodel),
pars = map(mod, tidy),
pred = map(mod, augment))
pars <- by_cyl %>% unnest(pars)
prediction <- by_cyl %>% unnest(pred)

Resources