Writing loop on linear regression - r

I am trying a problem what i found in redit and was experimenting how to do that using mtcars data set
This was the problem:
He is having list that looks like this: https://gyazo.com/0637f2226d8f53db4c90716bd3fb698c with 150 different "selskapsid".
He want to do a linear regression with "Return12" as the dependent variable and "SROE", "MktCap", and "y" and independent variables for each "Selskapsid". (Basically a row by row regression each row or for each id even the id got repeated i want separate model.)
I have read the comments in that didn't find any great solution so i was trying using dplyr and packages what I am bit comfort but the issue I was getting is cyl values are in factors so when I am trying to build the model cyl value is not repeating.
Does anyone know a simple loop to achieve this? I want to do training and testing in the same loop I wasn't getting training results also properly.
Using this libraries I was doing this:
library(tidyverse)
library(broom)
mtcars %>%
nest(-cyl) %>%
mutate(fit <-map(data, ~ lm(mpg ~ hp + wt + disp, data = .)),
results = map(fit, augment)) %>%
unnest(results)

Related

How to run ggpredict() in a loop following multiple regression models?

The aim is to get the output of the predicted probabilities of several regression models. First i run several regression models using the following code:
library(dplyr)
library(tidyr)
library(broom)
library(ggeffects)
mtcars$cyl=as.factor(mtcars$cyl)
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = tidy(lm(mpg ~ wt + gear + am , data = .), conf.int=TRUE)) %>%
gather(model_name, model, -cyl) %>% ## make it long format
unnest()
I would like to get the predicted probabilities of my predictor weight (wt). If i want to run the code manually for each different cylinder (cyl), it will look as the following:
#Filter by number of cylinders
df=filter(mtcars, cyl==4)
#Save the regression
mod= lm(mpg ~ wt + gear + am, data = df)
#Run the predictive probabilities
pred <- ggpredict(mod, terms = c("wt"))
This will be the code for only the first cylinder cyl==4, then we would have to run the same code for the second (cyl==6) and the third (cyl==8). This is a bit cumbersome. My aim is to automize that as i do for the regression analyses in the first code above. Also, I would like to get these results in the same format as the first code. In other words, they should be in a format that could be plotted afterwards. Can someone help me with that?
Rerun the models with ggpredict() on the inside:
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = ggpredict(lm(mpg ~ wt + gear + am, data= .), terms = c("wt"))) %>%
gather(model_name, model, -cyl) %>% unnest_legacy()
You can then plot wt (in the 'x' column) against 'predicted'. Note that you'll get a warning message on these data.

Obtain P-Value of Fixed Value in Anova Table of many Linear Regressions with Broom Package

In the multi linear regression lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3) there is one fixed variable (trial) and many x variables. I am analysing each x variable against FE_FCE2 with trial as fixed effect. I than use the boom package for the many regressions and plot the results in one table. I have obtained the results for the regression results. However cannot add the data from ANOVA Table into the Broom packages with map function.
Is it possible? And Yes How?
I have used the following formula to obtain Data from Results from Regression:
DF_FCE3 %>%
select(-FE_FCE2, -Trial) %>% # exclude outcome, leave only predictors
map( ~lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3)) %>%
map(summary) %>%
map_df(glance) %>%
round(3) -> rsme
However I would like to obtain the P-Value (**4.26e-08 *****) from the ANOVA Table of Trial.
To
see if Trial had a significant influence on the x variable.
**$x1
Analysis of Variance Table
**Response: FE_FCE2
Df Sum Sq Mean Sq F value Pr(>F)
Trial 3 0.84601 0.282002 15.0653 **4.26e-08 *****
.x 1 0.00716 0.007161 0.3826 0.5377
Residuals 95 1.77827 0.018719**
---**
Is it possible to use the broom package with map function to obtain a table which contains all the many p values of the anova regressions?
Like this (using mpg)?
This returns a dataframe with the original columns and one row containing the p-value except for the outcome and target (hwy and cyl in thisexample, FE_FCE2 and Trial in your case).
mpg %>%
select(-hwy, -cyl) %>% # exclude outcome, leave only predictors
map( ~lm(hwy ~ cyl + .x, data = mpg)) %>%
map(anova) %>%
map(broom::tidy) %>%
map_df(~.$p.value[1])

Omitting covariates in a tbl_regression from gtsummary package

Using gtsummary I want to display my adjusted linear regression model without
displaying the covariates. So far I have not found a solution for this. Does anyone know how best to do this?
For example, using the code below, I would like to diplay the first row which shows the cylinder variable and omit the subsequent rows (disp and hp).
# download pacman package if not installed, otherwise load it
if(!require(pacman)) install.packages(pacman)
# loads relevant packages using the pacman package
pacman::p_load(
magrittr, # for pipes
gtsummary) # for tables
mtcars %>%
lm(mpg ~ cyl + disp + hp, data = .) %>%
tbl_regression()
The table currently looks like this...
As per Daniel's suggestion:
library(gtsummary)
mtcars %>%
lm(mpg ~ cyl + disp + hp, data = .) %>%
tbl_regression(include = c("cyl","disp"))

PLM package: Balanced data shown as unbalanced in regression

The dataset which I am using here is unbalanced, but I balanced it manually like this by removing the multiple observations for same ID (this is a characteristic of my data as a single household later split to different ones). T is 2 here.
dataset %>% group_by(ID) %>% summarise(N =n()) %>% filter(N> 2 | N < 2)
Then I removed these rogue observations.So now the panel is balanced.I converted them to pdata afterwards
dataset <-plm.data(dataset, 30462)
And when I run is.pbalanced, it shows TRUE. But the problem is when I run the regression
plm(DEP~ VAR1 + VAR2, data= dataset, model= "within")
The summary shows this
Unbalanced Panel: n=20236, T=1-2, N=34920
I don't understand what I am missing here. Any suggestions will be greatly appreciated.

Fitting several regression models after group_by with dplyr and applying the resulting models into test sets

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

Resources