I'm trying to use dplyr and the pipe operator (%>%) to retrieve model objects stored in a dataframe.
With example data
library(dplyr)
set.seed(256)
dat <-
data.frame(x = rnorm(100),
y = rnorm(100, 10),
spec = sample(c("1", "2"), 100, TRUE)) %>%
group_by(spec) %>%
do(lm = lm(y ~ x, data = .))
I can subset and retrieve an actual model object
> dat$lm[dat$spec == "1"][[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
9.8171 -0.2292
> dat$lm[dat$spec == "1"][[1]] %>% class()
[1] "lm
But I think this is an inelegant way of retrieving the lm() model object contained therein, especially given that the rest of my code is structured the "dplyr way". I'd like to use dplyr but I can't figure out how. For example, using
dat %>% filter(spec == "1") %>% select(lm)
doesn't work as it returns
Source: local data frame [1 x 1]
Groups: <by row>
# A tibble: 1 x 1
lm
<list>
1 <S3: lm>
and
dat %>% filter(spec == "1") %>% .$lm
only gets me to the first object in list, e.g.,
> dat %>% filter(spec == "1") %>% .$lm
[[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
10.01495 -0.07438
I can't figure out a way to get to the actual model object in the dat with dplyr. Certainly, I could use broom and tidy() to condense everything
library(broom)
tidy(dat, lm)
but this still doesn't return the actual model object:
> tidy(dat, lm)
# A tibble: 4 x 6
# Groups: spec [2]
spec term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 10.0 0.120 83.3 1.91e-54
2 1 x - 0.0744 0.111 - 0.671 5.05e- 1
3 2 (Intercept) 9.86 0.131 75.0 1.42e-50
4 2 x - 0.0793 0.148 - 0.535 5.95e- 1
I can even use dplyr to summarise() the output from a do() call and retrieve the coefficients from the models, but this still doesn't give me the model object itself:
dat %>%
select(spec) %>%
bind_cols(dat %>%
summarize(lm_i = coefficients(lm)[[1]],
lm_s = coefficients(lm)[[2]]))
Is there a dplyr way to retrieve the actual model object from models created with do()?
do returns a list column, so to extract its individual elements, you need to use list subsetting. There are various ways to do that, but in the tidyverse, purrr::pluck is a nice option to extract a single [possibly deeply nested] element:
library(tidyverse)
dat %>% pluck('lm', 1)
#>
#> Call:
#> lm(formula = y ~ x, data = .)
#>
#> Coefficients:
#> (Intercept) x
#> 10.01495 -0.07438
It's mostly equivalent to [[ subsetting, i.e.
dat[['lm']][[1]]
To get what you have to work, you need to keep subsetting, as .$lm returns the list column, which in this case is a list of a model. .[[1]] (akin to the 1 above) extracts the model from the list:
dat %>% filter(spec == "1") %>% .$lm %>% .[[1]]
or a hybrid approach, if you like:
dat %>% filter(spec == "1") %>% pluck('lm', 1)
or use pull to extract the column with NSE semantics:
dat %>% filter(spec == "1") %>% pull(lm) %>% pluck(1)
All return the same thing.
Related
I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?
Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3
I need to make a constrained model by group in R. I tried the group_by and do() functions to estimate the unconstrained lm, but when I try the same for a constrained model with ConsReg it doesn´t work.
This worked for the unconstrained lm:
df_grouped <- df %>%
group_by(type, Region)
grouped_lm <- df_grouped %>%
do(tidy(lm(y ~ x, data =.)))
For the constrained model I tried this:
grouped_lm_constrained <- df_grouped %>%
do(ConsReg(formula = y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data =.))
but gives me this error:
"Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not ConsReg."
Does anyone know what's happening?
The problem you are facing stems from the broom::tidy function, which has no implementation for ConsReg models/objects. What you could do is write your custom function for extraction of the desired content from a ConsReg model/object. To know what the model object has in its belly you can i.e. generate just one model (one group) and call str(model) on it as well as str(summary(model)) to see what base R can do for you in terms of structuring the data. In the example below I extracted a not selection of what could be importante model content. You might have to adapt this according to your usecase and needs.
I really like the aproach of nested lists in tibbles and running models on those. Anyhow you can run the do() approach or even split the data.frame into a list where each item is a group and work mapping functions for example on those.
library(ConsReg)
library(dplyr)
library(purrr)
library(tidyr)
# Dummy data
df <- data.frame(g = sort(rep(c("A", "B") , "10")),
x = rep(1:10, 2),
y = c(1:10, seq(from = 1, to = 100, by = 10)))
# custom function which takes a model as input and parses the formula, coefficients plus aditional data and MAPE as a data.frame
myfun <- function(x){
cbind(fromula = x$formula %>% deparse,
as.data.frame(summary(x)$coefficients) %>% tibble::rownames_to_column() %>% dplyr::rename(Term = 1),
MAPE = x$metrics$MAPE)
}
# group the df for nesting in the next step
dplyr::group_by(df, g) %>%
# nest the columns of interest into a list where each item (aka group) contains the mentioned variables
tidyr::nest(data = c("x", "y")) %>%
# run run map functions on data to generate model and the custom extraction function
dplyr::mutate(crmod = purrr::map(data, ~ ConsReg(y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data = .x)),
stats = purrr::map(crmod, ~ myfun(.x))) %>%
# unnest the stats column from list items do df row(s)
tidyr::unnest(stats)
# Groups: g [2]
g data crmod fromula Term Estimate StdErr t.value p.value MAPE
<chr> <list> <list> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 1.12e-15 7.79e-16 1.44 1.87e- 1 1.87e-16
2 A <tibble [10 x 2]> <ConsReg> y ~ x x 1 e+ 0 1.03e- 1 9.67 1.09e- 5 1.87e-16
3 B <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 9.84e- 1 3.03e- 2 32.5 8.79e-10 8.58e- 1
4 B <tibble [10 x 2]> <ConsReg> y ~ x x 9.98e- 1 7.83e- 3 128. 1.59e-14 8.58e- 1
I have a data frame and have done an ANOVA between the data. After the ANOVA I want to use one of the resulting columns to do a calculation and create a new column with the mutate() function. However, an error appears indicating that this operation cannot be done on an anova class object:
Error: `x` must be a vector, not a <anova_test/data.frame/rstatix_test> object.
Can someone help me perform calculations (F + 1) with the F column of the ANOVA result?
library(dplyr)
library(rstatix)
Temperature <- factor(c(rep("cold", times = 4),
rep("hot", times = 4)),
levels = c("cold", "hot"))
Light <- factor(rep(c(rep("blue", times = 2),
rep("yellow", times = 2)),
times = 2),
levels = c("blue", "yellow"))
Result <- c(90.40, 85.20, 21.70, 25.30,
75.12, 77.36, 6.11, 10.8)
Data <- data.frame(Temperature, Light, Result)
NewColumn <- Data %>%
anova_test(formula = Result ~ Temperature*Light) %>%
mutate(New= `F` + 1) #<-------- Not working
As mentioned by JKupzig in the comments, this is a known issue in dplyr as documented here: https://github.com/tidyverse/dplyr/issues/5286.
The issue is caused by anova_test() creating an output data frame with classes anova_test, data.frame and rstatix_test, in that order, while mutate() from dplyr seems to get hung up if the last element in the class vector is not data.frame. You can verify the classes of the output of the anova as follows:
Data %>% anova_test(formula = Result ~ Temperature*Light) %>% class()
[1] "anova_test" "data.frame" "rstatix_test"
As a workaround, you can add as_tibble() to your dplyr pipe after anova_test(). This resets the classes to tbl_df, tbl, and data.frame, in that order.
Data %>% anova_test(formula = Result ~ Temperature*Light) %>% as_tibble() %>% class()
[1] "tbl_df" "tbl" "data.frame"
Data %>%
anova_test(formula = Result ~ Temperature*Light) %>%
as_tibble() %>%
mutate(New= `F` + 1)
# A tibble: 3 x 8
Effect DFn DFd F p `p<.05` ges New
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 Temperature 1 4 42.2 0.003 "*" 0.914 43.2
2 Light 1 4 1041. 0.0000055 "*" 0.996 1042.
3 Temperature:Light 1 4 0.725 0.442 "" 0.153 1.72
Note, that this action removes the classes anova_test and rstatix_test. If these classes are important down the line, use a different workaround with set_class() from the magrittr package (magrittr is a dependency of dplyr, so no need to install it separately).
Data %>%
anova_test(formula = Result ~ Temperature*Light) %>%
magrittr::set_class(c("anova_test", "rstatix_test", "data.frame")) %>%
class()
[1] "anova_test" "rstatix_test" "data.frame"
Data %>%
anova_test(formula = Result ~ Temperature*Light) %>%
magrittr::set_class(c("anova_test", "rstatix_test", "data.frame")) %>%
mutate(New = `F` + 1)
ANOVA Table (type II tests)
Effect DFn DFd F p p<.05 ges New
1 Temperature 1 4 42.250 3.00e-03 * 0.914 43.250
2 Light 1 4 1041.366 5.50e-06 * 0.996 1042.366
3 Temperature:Light 1 4 0.725 4.42e-01 0.153 1.725
I am trying to build a linear model, and then make predictions with new data based on that linear model. The following chunk of code takes a given set of data (data1), and produces 20 models based on the fact that when I group by ID and plot, there are 20 groups:
modelobject <- data_1 %>%
group_by(ID, plot) %>%
do(model = lm(air_temp ~ water_temp, data = .)) %>%
ungroup()
Now that the model is designed, I want to use the map() function to make predictions across a new set of data (data_2) for each of those models:
modelled_values <- map(modelobject$model, ~ spread_predictions(data = data_2, models = .x))
This works great, except for the fact that the subsequent object modelled_values doesn't have the identifying features of the original models (i.e. their given ID and plot) as can be seen in the following output for the Value column (it produces 11 columns, none of which are identifying features):
Value
List of length 20
A data.frame with 52606 rows and 11 columns
....
I have ended up having to assume that they are just in the order I produced them in and manually label each model object with the following style of code:
modelled_values[[1]]$ID <- "ID1"
modelled_values[[2]]$ID <- "ID1"
modelled_values[[3]]$ID <- "ID2"
modelled_values[[4]]$ID <- "ID2"
...
Is there any way I can carry the identifying features of the original models over to these predicted data?
What about something like this:
modelobject <- mtcars %>%
group_by(vs, am) %>%
do(model = lm(mpg ~ hp, data = .))
preds <- modelobject %>%
group_by(vs, am) %>%
rowwise %>%
summarise(preds = list(predict(model, newdata=mtcars)))
preds
# # A tibble: 4 x 3
# # Groups: vs, am [4]
# vs am preds
# <dbl> <dbl> <list>
# 1 0 0 <dbl [32]>
# 2 0 1 <dbl [32]>
# 3 1 0 <dbl [32]>
# 4 1 1 <dbl [32]>
In the code above, preds is now a tibble with a column called preds where each element is as vector of predictions from the model for the relevant vs and am values in the row.
I am trying to write a function that iterates (or uses purrr::map()) through every level of a factor, and fits an lm() model for the subset of the data where the factor is equal to that level.
To make a simple reproducable example with mtcars, just say that I'd like a different lm model for each value of mtcars$gear. I'll start by making it a factor, because my real problem involves iteration through a factor:
library(tidyverse)
mtcars <- mtcars %>%
mutate(factor_gear = factor(gear))
I'd like the function to fit every level of factor_gear. The levels are given by:
levels(mtcars$factor_gear)
i.e.
[1] "3" "4" "5"
So the output I would be looking for would be:
fit1 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="3"))
fit2 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="4"))
fit3 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="5"))
fits <- list(fit1, fit2, fit3)
I've made a start on the function, but wasn't able to get it to work.
I thought that a function should:
get every level of of the factor into a vector
run an lm model for each level.
fit_each_level <- function(factor_variable) {
# trying to: 1. get every level of of the factor into a vector
factor_levels <- levels(df_cars$factor_variable)
# trying to: 2. run an lm model for each level.
for i in factor_levels {
fit <- mtcars %>% filter(factor_variable==i [# every value of segment_levels]) %>%
lm(mpg ~ cyl, data = . )
}
}
fit_each_level(factor_gear)
If the function worked well, I'd ultimately be able to do do it on another factor, eg:
mtcars <- mtcars %>%
mutate(factor_carb = factor(carb))
fit_each_level(factor_carb)
You can nest the dataframe and use map to apply lm for each factor_gear.
library(dplyr)
mtcars %>%
group_by(factor_gear) %>%
tidyr::nest() %>%
mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))
# factor_gear data model
# <fct> <list> <list>
#1 4 <tibble [12 × 11]> <lm>
#2 3 <tibble [15 × 11]> <lm>
#3 5 <tibble [5 × 11]> <lm>
In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.
mtcars %>%
group_by(factor_gear) %>%
summarise(model = list(lm(mpg ~ cyl, data = cur_data())))
Make sure you have the latest version of dplyr (1.0.0). Then you can use:
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::mutate(factor_gear = factor(gear)) %>%
dplyr::nest_by(factor_gear) %>%
dplyr::summarise(model_coefs(mpg ~ cyl, data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 3
# Groups: factor_gear [3]
factor_gear `(Intercept)` cyl
<fct> <dbl> <dbl>
1 3 29.8 -1.83
2 4 41.3 -3.59
3 5 40.6 -3.2