2-Way Anova Loop & Export to CSV or Excel Table - r

I've figured out how to run a 2-way anova on several variables in my data frame, but not sure how to get this into a format that could be easily exported to a csv file or excel. Ideally, I'd like it to have this in a format where each of my several hundred dependent variables is in it's own row, with the pVaules and Fvalues
I've made an example using the titanic dataset. In this case I've set Sex & Embarked as my categorical variables, and would like the output for the effects of Sex Embarked and ~Interaction somehow saved to a file. I'm open to suggestions on how to output this -- just want to be able to easily identify what values are significant, ideally with each dependent variable on its own line.
library(titanic)
library(tidyverse)
df1 <-
titanic_train %>%
select(Sex, Embarked, (1:10)) %>%
select(!("Name" | "Ticket")) %>%
filter(Embarked != "") # deleting empty Embarked status
names(df1)
df1$Sex<- factor(df1$Sex)
df1$Embarked <-factor(df1$Embarked)
#store all formulae in a list
formulae <- lapply(colnames(df1)[3:ncol(df1)], function(x) as.formula(paste0(x, " ~ Sex * Embarked")))
#go through list and run aov()
results <- lapply(formulae, function(x) summary(aov(x, data = df1)))
names(results) <- format(formulae)
results

You can extract the relevant statistics from the summary or store the model in a list and use broom::tidy on it to get all the stats together in a dataframe. Use map functions to run it on list of models.
library(purrr)
library(broom)
results <- lapply(formulae, function(x) aov(x, data = df1))
names(results) <- format(formulae)
data <- map_df(results, tidy, .id = 'formulae')
data
# A tibble: 28 x 7
# formulae term df sumsq meansq statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 PassengerId ~… Sex 1 1.09e+5 1.09e+5 1.65 1.99e- 1
# 2 PassengerId ~… Embark… 2 5.50e+4 2.75e+4 0.416 6.60e- 1
# 3 PassengerId ~… Sex:Em… 2 7.73e+4 3.86e+4 0.584 5.58e- 1
# 4 PassengerId ~… Residu… 883 5.84e+7 6.61e+4 NA NA
# 5 Survived ~ Se… Sex 1 6.16e+1 6.16e+1 376. 4.44e-70
# 6 Survived ~ Se… Embark… 2 3.32e+0 1.66e+0 10.1 4.39e- 5
# 7 Survived ~ Se… Sex:Em… 2 4.85e-1 2.43e-1 1.48 2.28e- 1
# 8 Survived ~ Se… Residu… 883 1.45e+2 1.64e-1 NA NA
# 9 Pclass ~ Sex … Sex 1 1.01e+1 1.01e+1 16.2 6.12e- 5
#10 Pclass ~ Sex … Embark… 2 5.83e+1 2.91e+1 46.8 4.74e-20
# … with 18 more rows
Write data to csv.
write_csv(data, 'data.csv')

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

Mapping broom::tidy to nested list of {fixest} models and keep name of list element

I want to apply broom::tidy() to models nested in a fixest_multi object and extract the names of each list level as data frame columns. Here's an example of what I mean.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
This command estimates two models for each dep. var. (Ozone and Solar.R) for a subset of each Month plus the full sample. Here's how the resulting object looks like:
> names(multiple_est)
[1] "Full sample" "5" "6" "7" "8" "9"
> names(multiple_est$`Full sample`)
[1] "Ozone" "Solar.R"
I now want to tidy each model object, but keep the information of the Month / Dep.var. combination as columns in the tidied data frame. My desired output would look something like this:
I can run map_dfr from the tidyr package, giving me this result:
> map_dfr(multiple_est, tidy, .id ="Month") %>% head(9)
# A tibble: 9 x 6
Month term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Full sample (Intercept) -71.0 23.6 -3.01 3.20e- 3
2 Full sample Wind -3.06 0.663 -4.61 1.08e- 5
3 Full sample Temp 1.84 0.250 7.36 3.15e-11
4 5 (Intercept) -76.4 82.0 -0.931 3.53e- 1
5 5 Wind 2.21 2.31 0.958 3.40e- 1
6 5 Temp 3.07 0.878 3.50 6.15e- 4
7 6 (Intercept) -70.6 46.8 -1.51 1.45e- 1
8 6 Wind -1.34 1.13 -1.18 2.50e- 1
9 6 Temp 1.64 0.609 2.70 1.29e- 2
But this tidies only the first model of each Month, the model with the Ozone outcome.
My desired output would look something like this:
Month outcome term estimate more columns from tidy
Full sample Ozone (Intercept) -71.0
Full sample Ozone Wind -3.06
Full sample Ozone Temp 1.84
Full sample Solar.R (Intercept) some value
Full sample Solar.R Wind some value
Full sample Solar.R Temp some value
... rows repeated for each month 5, 6, 7, 8, 9
How can I apply tidy to all models and add another column that indicates the outcome of the model (which is stored in the name of the model object)?
So, fixest_mult has a pretty strange setup as I delved deeper. As you noticed, mapping across it or using apply just accesses part of the data frames. In fact, it isn't just the data frames for "Ozone", but actually just the data frames for the first 6 data frames (those for c("Full sample", "5", "6").
If you convert to a list, it access the data attribute, which is a sequential list of all 12 data frames, but dropping the relevant names you're looking for. So, as a workaround, could use pmap() and the names (found in the attributes of the object) to tidy() and then use mutate() for your desired columns.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
nms <- attr(multiple_est, "meta")$all_names
pmap_dfr(
list(
data = as.list(multiple_est),
month = rep(nms$sample, each = length(nms$lhs)),
outcome = rep(nms$lhs, length(nms$sample))
),
~ tidy(..1) %>%
mutate(
Month = ..2,
outcome = ..3,
.before = 1
)
)
#> # A tibble: 36 × 7
#> Month outcome term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Full sample Ozone (Intercept) -71.0 23.6 -3.01 3.20e- 3
#> 2 Full sample Ozone Wind -3.06 0.663 -4.61 1.08e- 5
#> 3 Full sample Ozone Temp 1.84 0.250 7.36 3.15e-11
#> 4 Full sample Solar.R (Intercept) -76.4 82.0 -0.931 3.53e- 1
#> 5 Full sample Solar.R Wind 2.21 2.31 0.958 3.40e- 1
#> 6 Full sample Solar.R Temp 3.07 0.878 3.50 6.15e- 4
#> 7 5 Ozone (Intercept) -70.6 46.8 -1.51 1.45e- 1
#> 8 5 Ozone Wind -1.34 1.13 -1.18 2.50e- 1
#> 9 5 Ozone Temp 1.64 0.609 2.70 1.29e- 2
#> 10 5 Solar.R (Intercept) -284. 262. -1.08 2.89e- 1
#> # … with 26 more rows

Issue using "pred_yes" column as the estimate argument to roc_curve()

When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether
or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>%
mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#> promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 no 364 4.89 1 no 0.973 0.0267
#> 2 no 342 3.74 3 no 0.973 0.0267
#> 3 yes 716 3.16 3 yes 0 1
#> 4 no 450 3.21 3 no 0.973 0.0267
#> 5 no 372 3.87 3 no 0.973 0.0267
#> 6 no 535 4.47 2 no 0.973 0.0267
#> 7 yes 736 3.94 4 yes 0 1
#> 8 no 330 2.54 2 no 0.973 0.0267
#> 9 no 478 3.48 2 no 0.973 0.0267
#> 10 yes 728 2.66 3 yes 0 1
#> # … with 78 more rows
sales_preds %>%
roc_curve(promoted, .pred_yes, event_level = "second") %>%
autoplot()
Created on 2021-09-08 by the reprex package (v2.0.1)

How to make calculations separated by factors within a data frame using R?

I have a data table with the results of an experiment that evaluated 2 factors: Light and Day_light at 2 different temperatures (Temperature).
I performed a 2-way ANOVA for each Temperature using the rstatix package.
The result of the 2-way ANOVA is presented as a data table, with a column called SSn. I would like to divide each SSn value by the sum of all SSn values for each Temperature. For that, I used an approach similar to the one that the rstatix package uses, but I was not successful. Below, I present the code I used and a brief graphic explanation of what I want to accomplish.
library(rstatix)
# Data frame
Temperature <- factor(c(rep("cold", times = 8),
rep("hot", times = 8)),
levels = c("cold", "hot"))
Light <- factor(rep(c(rep("blue", times = 4),
rep("yellow", times = 4)),
times = 2),
levels = c("blue", "yellow"))
Day_light <- factor(rep(c(rep("Day", times = 2),
rep("Night", times = 2)),
times = 4),
levels = c("Day", "Night"))
Result <- c(90.40, 85.20, 21.70, 25.30,
75.12, 77.36, 6.11, 10.8
85.14, 88.96, 30.21, 35.15)
Data <- data.frame(Temperature, Light, Day_light, Result)
# ANOVA
ANOVA <- Data %>%
group_by(Temperature) %>%
anova_test(Result ~ Light * Day_light,
detailed = TRUE)
ANOVA
# Calculations within the ANOVA data frame (not running)
Calculations <- ANOVA %>%
group_by(Temperature) %>%
ANOVA$SSn/sum(ANOVA$SSn)*100
Calculations
> ANOVA
# A tibble: 6 x 10
Temperature Effect SSn SSd DFn DFd F p `p<.05` ges
* <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 cold Light 354. 33.5 1 4 42.2 0.003 "*" 0.914
2 cold Day_light 8723. 33.5 1 4 1041. 0.0000055 "*" 0.996
3 cold Light:Day_light 6.07 33.5 1 4 0.725 0.442 "" 0.153
4 hot Light 773. 23.1 1 4 134. 0.000318 "*" 0.971
5 hot Day_light 5014. 23.1 1 4 869. 0.00000788 "*" 0.995
6 hot Light:Day_light 37.0 23.1 1 4 6.41 0.065 "" 0.616
I have already solved it partially, but I still don't know how to separate the calculation by Temperature
ANOVA$Calculations <- ANOVA$SSn/sum(ANOVA$SSn)*100
Graphical representation of my question
then...
I tend to stick to data.table personally, because it has some nice benefits, and there is quite a learning curve involved.
The traditional plyr way is also shown:
library(data.table)
ANOVA <- as.data.table(ANOVA)
ANOVA[, Calculations := SSn / sum(SSn) , by=Temperature ]
ANOVA
## and the plyr way:
ANOVA %>% group_by( Temperature ) %>%
mutate( Calculations = SSn / sum(SSn) )

extract random effects from MCMCglmm

I have a function which runs an MCMCglmm a bunch of times.
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ 1,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
This section extracts fixed effects only.
(summary(.x)$Solutions, rownames = "model_term")
But note that I am not running a model without any fixed effects and so the output is empty.
How can I extract random effects using the same or similar code?
I guess I can change 'solutions' to something else to extract random effects from a model I have run without any fixed effects.
Note that this is an extension to a previous question (with example df) here - lapply instead of for loop for randomised hypothesis testing r
A relatively easy way to do this is with broom.mixed::tidy. It's not clear whether you mean you want to extract the summary for the top-level random effects parameters (i.e. the variances of the random effects), or for the estimates of the group-level effects.
library(broom.mixed)
tidy(m, effects="ran_pars")
##
## effect group term estimate std.error
## 1 ran_pars Year var__(Intercept) 0.00212 629.
## 2 ran_pars Residual var__Observation 40465. 24211.
If you want the group-level effects you need effects="ran_vals", but you have to re-run your model with pr=TRUE (or do it that way in the first place) in order to have these effects saved in the model object:
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical", pr=TRUE)
tidy(m, effects="ran_vals")
effect group level term estimate std.error
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 ran_vals Year 1992 (Intercept) 2.65e-8 4.90
2 ran_vals Year 1993 (Intercept) 1.14e-8 6.23
3 ran_vals Year 1994 (Intercept) 1.28e-8 4.88
4 ran_vals Year 1995 (Intercept) -6.83e-9 5.31
5 ran_vals Year 1996 (Intercept) -1.36e-8 5.07
6 ran_vals Year 1997 (Intercept) 1.31e-8 5.24
7 ran_vals Year 1998 (Intercept) -2.80e-9 5.25
8 ran_vals Year 1999 (Intercept) 3.52e-8 5.68

Resources