Select colums from dataframe for linear regression r - r

I have a dataframe Data_Group_7_8 and would like to make a linear regression based on a factor analysis.
The factor analysis paired variables from col 1:4 as MR1 and col 16:20 as MR2. I want to set col 1:4 as independent variable and 16:20 as dependent and tried the following code:
mdl <- lm(select(1:4) ~ select(16:20), data=Data_Group_7_8)
summary(mdl)
Which unfortunately doesn't work. But the following does:
df2 <- data.frame(x=Data_Group_7_8 %>% select(1:4),y=Data_Group_7_8 %>% select(16:20))
lrm <- lm(x.Themenwelt_1+ x.Themenwelt_2+ x.Themenwelt_3+ x.Product_demonstration ~ y.Inspired_by_1+ y.Inspired_by_2+ y.Inspired_by_3+ y.Inspired_by_4+ y.Inspired_by_5, data=df2)
summary(lrm)
Is there a way to select the variables (Themenwelt_1 etc.) directly from the original Data_Group_7_8 (as I have tried in code 1) instead of adding them all up from a new df as I have to do 60 different analyses with this df.

R allows you to build a formula from a string using as.formula(str). Each side will have to have the sum of the terms considered, and the LHS and RHS need to be joined with a tilde. You can get the names of the columns using names(), then it it just a matter of pasting them together, first each side of the equation, collapsing the character vector to a single string with collapse = '+', then combining the two sides separated by a tilde. This is an example with the built-in mtcars dataset:
regFormula <- function(dat,range1,range2){
dat %>%
select(range1) %>%
names() %>%
paste(collapse = ' + ') %>%
paste(dat %>%
select(range2) %>%
names() %>%
paste(collapse = ' + '),
sep = ' ~ ') %>%
as.formula()
}
regFormula(mtcars,1:3,4:5)
# mpg + cyl + disp ~ hp + drat
# <environment: 0x000000000cf55c90>
You can use this directly as the formula in your linear model.

Related

Multiple fixest_multi models and shape parameter - modelsummary package

Again, thanks to Laurent for answering questions and supporting the modelsummarypackage.
library(tidyverse)
library(fixest)
library(modelsummary)
fit<-mtcars %>% feols(c(mpg,hp )~1)
fit_1 <- mtcars %>% feols(c(mpg,hp,wt )~1)
fit_2 <- mtcars %>% feols(c(mpg,hp,gear, wt )~1)
modelsummary(c(fit, fit_1, fit_2), shape=model + statistic ~ term, output="flextable")
We obtain the long column of estimates from all 3 models (which are simple averages) as in:
So is there a way to rearrange the columns and the rows either using internal modelsummary functions or external work to the following format:
The biggest problem is moving the terms around so that they are aligned on the same line (note that the order of terms between fit_1 and fit_2 is changed) and the rest is filled with NA. Would really appreciate any help! It's a part of a larger problem I've been trying to solve unsuccessfully for the last 3 weeks.
One option is to output to a data frame and reshape manually:
library(tidyverse)
library(fixest)
library(modelsummary)
library(flextable)
fit<-mtcars %>% feols(c(mpg,hp )~1)
fit_1 <- mtcars %>% feols(c(mpg,hp,wt )~1)
fit_2 <- mtcars %>% feols(c(mpg,hp,gear, wt )~1)
models <- c(fit, fit_1, fit_2)
modelsummary(
models,
output = "dataframe",
shape = model + statistic ~ term) |>
mutate(
fit = c(rep(1, 4), rep(2, 6), rep(3, 8)),
model = trimws(model)) |>
pivot_wider(names_from = "fit", values_from = "(Intercept)") |>
select(-statistic, -part, ` ` = model) |>
flextable()
This is a very customized shape and modeling context, so I can't currently think of a way to achieve that purely using internal modelsummary functions arguments.

How to run ggpredict() in a loop following multiple regression models?

The aim is to get the output of the predicted probabilities of several regression models. First i run several regression models using the following code:
library(dplyr)
library(tidyr)
library(broom)
library(ggeffects)
mtcars$cyl=as.factor(mtcars$cyl)
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = tidy(lm(mpg ~ wt + gear + am , data = .), conf.int=TRUE)) %>%
gather(model_name, model, -cyl) %>% ## make it long format
unnest()
I would like to get the predicted probabilities of my predictor weight (wt). If i want to run the code manually for each different cylinder (cyl), it will look as the following:
#Filter by number of cylinders
df=filter(mtcars, cyl==4)
#Save the regression
mod= lm(mpg ~ wt + gear + am, data = df)
#Run the predictive probabilities
pred <- ggpredict(mod, terms = c("wt"))
This will be the code for only the first cylinder cyl==4, then we would have to run the same code for the second (cyl==6) and the third (cyl==8). This is a bit cumbersome. My aim is to automize that as i do for the regression analyses in the first code above. Also, I would like to get these results in the same format as the first code. In other words, they should be in a format that could be plotted afterwards. Can someone help me with that?
Rerun the models with ggpredict() on the inside:
df <- mtcars %>%
group_by(cyl) %>%
do(model1 = ggpredict(lm(mpg ~ wt + gear + am, data= .), terms = c("wt"))) %>%
gather(model_name, model, -cyl) %>% unnest_legacy()
You can then plot wt (in the 'x' column) against 'predicted'. Note that you'll get a warning message on these data.

How to map over dataframe, is it a tidyeval error?

Want to map over columns in a dataframe & perform t-tests with each column against a fixed column. Desired output would be a dataframe with each row(s) being t-test results - can use map_dfr once mapping process ok
Dug into tidy eval, not sure if it's a tidy eval error - any help much appreciated!
(mtcars as toy dataset)
library(rstatix)
# Test single cases - good
compare_means(mpg ~ cyl, data = mtcars)
compare_means(disp ~ cyl, data = mtcars)
compare_means(hp ~ cyl, data = mtcars)
# Trial map - fail
mtcars %>%
map(~compare_means(.x ~ cyl, data = mtcars))
Error: Can't subset columns that don't exist.
x Column `.x` doesn't exist.
Following tidyeval guidance: https://tidyeval.tidyverse.org/dplyr.html
Tried to see if quoting / unquoting was the issue, but no dice
# Abstract variables
test_data <- function(group_var) {
quote_var <- enquo(group_var)
data %>% compare_means(quote_var ~ cyl, data = mtcars)
}
That's an NSE error, but not tidyeval. You're mapping over the vectors inside mtcars. You're not mapping over the column names of mtcars.
With inject() from the last rlang version you can do some NSE programming with non-tidyeval functions:
names(mtcars) %>% map(~ rlang::inject(compare_means(!!sym(.x) ~ cyl, data = mtcars))
Three things are going on:
We map over the names of the data frame.
We transform the name to a symbol, i.e. an R variable.
We inject that symbol into the formula using inject() and !!.
I have not tested the code.
Actually, it may just be about formula evaluation specifially:
library(ggpubr)
library(tidyverse)
# Test data with 2 Species only
iris.subset <- iris %>%
filter(Species != 'virginica')
# Test single case
iris.subset %>%
compare_means(Sepal.Width ~ Species, data = .)
# Test direct map - doesn't work
iris.subset[1:4] %>%
map(~compare_means(. ~ Species, data = iris.subset))
Is it about formula evaluation? Test as.formula()
as.formula(paste0(names(iris.subset[1]), " ~ Species"))
# Pipe into test
names(iris.subset[1:4]) %>%
map_df(~compare_means(formula = as.formula(paste0(., " ~ Species")), data = iris.subset))
Success!!
Couldn't get an example to work with mtcars but will re-post if I do

How to change the order of coefficients in a coefficient plot in R (package dotwhisker)

Here is some sample code from the official package documentation.
#Package preload
library(dotwhisker)
library(broom)
library(dplyr)
# run a regression compatible with tidy
m1 <- lm(mpg ~ wt + cyl + disp + gear, data = mtcars)
m2 <- update(m1, . ~ . + hp) # add another predictor
m1_df <- tidy(m1) %>% filter(term != "(Intercept)") %>% mutate(model = "Model 1")
m2_df <- tidy(m2) %>% filter(term != "(Intercept)") %>% mutate(model = "Model 2")
two_models <- rbind(m1_df, m2_df)
dwplot(two_models)
which produces this:
The most logical order inside the plot would be to have the coefficients from model 1 above model 2. In any case I would like to know how to control the order of coefficients from distinct models (not the order of the variables themselves). I tried sorting the tidy dataframe with order or factorizing the model column with factor. Neither of the two work. Any advice would be most welcome.
You can change the order of the coefficients by reordering your tidy dataframe. A possible problem might be that the legend order changes as well, but this can be fixed as well.
dwplot(arrange(two_models, desc(model))) +
scale_color_discrete(breaks=c("Model 1","Model 2"))

Data Munging Challenge. How do I join the correct coefficients to the correct observation in a summarized table

Before I start, the a basic answer to this question can be found here:
Correctly binding coefficients to summarized table
This question is different in the fact that I need to correctly join the correct coefficients to the correct position in the summary table based on where a knot is placed. I use the I(pmax(0, variable - knot)) technique to place my splines. The end result is a table of unique values of each variable, a summarized measure and the correct model statistics (see my final (yet unfinished) table in below example code).
library(tidyverse)
library(broom)
#pull in and gather data
mtcars1 <- as_tibble(mtcars)
mtcars1$cyl <- as.factor(mtcars$cyl)
#run model and produce model-summary table
model <- glm(mpg ~ cyl + hp + I(pmax(0, hp - 100)), data = mtcars1)
model_summary <- tidy(model)
#produce final summary table
summary_table <- mtcars1 %>%
select(cyl, hp, wt) %>%
gather(key = variable, level, - wt) %>%
group_by(variable, level) %>%
summarise("sum_wt" = sum(wt)) %>%
mutate(term = paste0(variable, level)) %>%
left_join(model_summary, by = c("term" = "term"))
The challenge is taking the I(pmax(0, hp -100)) term in the model_summary table and correctly join the estimate, std.error, statistic and p.value to each hp observation in the summary_table that is <= 100, in addition to joining the other hp estimate statistics to the hp observation in the summary_table that is > 100.

Resources