Keep identifying features when using map() to "spread_predictions" in R? - r

I am trying to build a linear model, and then make predictions with new data based on that linear model. The following chunk of code takes a given set of data (data1), and produces 20 models based on the fact that when I group by ID and plot, there are 20 groups:
modelobject <- data_1 %>%
group_by(ID, plot) %>%
do(model = lm(air_temp ~ water_temp, data = .)) %>%
ungroup()
Now that the model is designed, I want to use the map() function to make predictions across a new set of data (data_2) for each of those models:
modelled_values <- map(modelobject$model, ~ spread_predictions(data = data_2, models = .x))
This works great, except for the fact that the subsequent object modelled_values doesn't have the identifying features of the original models (i.e. their given ID and plot) as can be seen in the following output for the Value column (it produces 11 columns, none of which are identifying features):
Value
List of length 20
A data.frame with 52606 rows and 11 columns
....
I have ended up having to assume that they are just in the order I produced them in and manually label each model object with the following style of code:
modelled_values[[1]]$ID <- "ID1"
modelled_values[[2]]$ID <- "ID1"
modelled_values[[3]]$ID <- "ID2"
modelled_values[[4]]$ID <- "ID2"
...
Is there any way I can carry the identifying features of the original models over to these predicted data?

What about something like this:
modelobject <- mtcars %>%
group_by(vs, am) %>%
do(model = lm(mpg ~ hp, data = .))
preds <- modelobject %>%
group_by(vs, am) %>%
rowwise %>%
summarise(preds = list(predict(model, newdata=mtcars)))
preds
# # A tibble: 4 x 3
# # Groups: vs, am [4]
# vs am preds
# <dbl> <dbl> <list>
# 1 0 0 <dbl [32]>
# 2 0 1 <dbl [32]>
# 3 1 0 <dbl [32]>
# 4 1 1 <dbl [32]>
In the code above, preds is now a tibble with a column called preds where each element is as vector of predictions from the model for the relevant vs and am values in the row.

Related

Running Levene's test for each column of a df in R

I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?
Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3

automate repeating models with different data forloop in R

I need to run a lot of replicates on the same model but cycle different data into it on each iteration.
e.g.
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
for(i in 1:db) {
# keep model structure but alternate the data
lm(mpg ~ wt, data = db[i])
}
I need to create a for-loop or a function that can run the model on db1, then swap in db2 and run the same model. I also need them to be stored as separate objects in my R environment e.g. lm1 (for db1) and lm2 (for db2)
Cn someone please help me automate this.
thanks
The method I would use to do something like this would be to use a map function over a list of dataframes. My preferred method would to use a nested dataframe where we have a column for dataframe name, the dataframe and we add a linear model column.
I have coded a version of this below using the map function which takes our vector of dataframes and applies lm to each entry.
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
# Place dataframes in a liset (note do not use c() to put dfs into an array)
a <- list(db1, db2 , db3)
# Construct our dataframe
df <- tibble(entry = 1:3, dataframes = a)
df %>%
# Map the lm function to all of the dataframes
mutate(lm = map(dataframes, ~lm(mpg~wt, data = .x)))
#> # A tibble: 3 x 3
#> entry dataframes lm
#> <int> <list> <list>
#> 1 1 <df[,11] [32 x 11]> <lm>
#> 2 2 <df[,11] [32 x 11]> <lm>
#> 3 3 <df[,11] [32 x 11]> <lm>
Created on 2021-04-06 by the reprex package (v2.0.0)
A slighlty more intuitive method with lists only could be as follows:
(Note that some information i.e. the call to lm is lost)
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
a <- list(db1, db2 , db3)
b <- rep(list(), 3)
for(i in 1:3) {
b[i] <- lm(mpg~wt, data = a[[i]])
}
#> Warning in b[i] <- lm(mpg ~ wt, data = a[[i]]): number of items to replace is
#> not a multiple of replacement length
b
#> [[1]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[2]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[3]]
#> (Intercept) wt
#> 37.285126 -5.344472
Created on 2021-04-06 by the reprex package (v2.0.0)
Create a list of data frames rather than individual data-frames as objects, as it is harder to loop db1, db2,db3 rather create data frames which are easier to loop inside lists. Here dfs created is basically list of dataframes on which you can create your models. Now here I have created random dataset with mtcars, In your case you might be having dataset already saved as db1, db2 or db3, so you can do either of these things:
a) dfs = list(db1, db2, db3) Use this dfs with lapply like this: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
b) dfs <- mget(ls(pattern='^db\\d+'), envir = globalenv()) , here inside pattern you put your pattern of data , In this case it starts with db word and ending with a number, now use the similar lapply like above: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
I have given one example from mtcars data using randomly selected rows to propose a way of doing it.
# Creating a list of data-frames randomly
# Using replicate function n(3) times here and picking 80% of data randomly, using seed value 1 for reproducibility
set.seed(1)
n <- 3
prop = .8
dfs <- lapply(data.frame(replicate(n, sample(1:nrow(mtcars), prop*nrow(mtcars)))), function(x)mtcars[x,])
## replicate function here replicates sample command n number of times and create a matrix of indexs of rows taken as different data points from mtcars dataset
mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x)) #mymodels is your output
Output:
$X1
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
38.912167 -5.874795
$X2
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
37.740419 -5.519547
$X3
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
39.463332 -6.051852

fit an `lm` model for every level of a factor

I am trying to write a function that iterates (or uses purrr::map()) through every level of a factor, and fits an lm() model for the subset of the data where the factor is equal to that level.
To make a simple reproducable example with mtcars, just say that I'd like a different lm model for each value of mtcars$gear. I'll start by making it a factor, because my real problem involves iteration through a factor:
library(tidyverse)
mtcars <- mtcars %>%
mutate(factor_gear = factor(gear))
I'd like the function to fit every level of factor_gear. The levels are given by:
levels(mtcars$factor_gear)
i.e.
[1] "3" "4" "5"
So the output I would be looking for would be:
fit1 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="3"))
fit2 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="4"))
fit3 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="5"))
fits <- list(fit1, fit2, fit3)
I've made a start on the function, but wasn't able to get it to work.
I thought that a function should:
get every level of of the factor into a vector
run an lm model for each level.
fit_each_level <- function(factor_variable) {
# trying to: 1. get every level of of the factor into a vector
factor_levels <- levels(df_cars$factor_variable)
# trying to: 2. run an lm model for each level.
for i in factor_levels {
fit <- mtcars %>% filter(factor_variable==i [# every value of segment_levels]) %>%
lm(mpg ~ cyl, data = . )
}
}
fit_each_level(factor_gear)
If the function worked well, I'd ultimately be able to do do it on another factor, eg:
mtcars <- mtcars %>%
mutate(factor_carb = factor(carb))
fit_each_level(factor_carb)
You can nest the dataframe and use map to apply lm for each factor_gear.
library(dplyr)
mtcars %>%
group_by(factor_gear) %>%
tidyr::nest() %>%
mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))
# factor_gear data model
# <fct> <list> <list>
#1 4 <tibble [12 × 11]> <lm>
#2 3 <tibble [15 × 11]> <lm>
#3 5 <tibble [5 × 11]> <lm>
In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.
mtcars %>%
group_by(factor_gear) %>%
summarise(model = list(lm(mpg ~ cyl, data = cur_data())))
Make sure you have the latest version of dplyr (1.0.0). Then you can use:
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::mutate(factor_gear = factor(gear)) %>%
dplyr::nest_by(factor_gear) %>%
dplyr::summarise(model_coefs(mpg ~ cyl, data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 3
# Groups: factor_gear [3]
factor_gear `(Intercept)` cyl
<fct> <dbl> <dbl>
1 3 29.8 -1.83
2 4 41.3 -3.59
3 5 40.6 -3.2

Executing a statistical test across multiple subsets using purrr map

I'm trying to use purr map functionality to create a number of sub-groups from a dataframe so that I can run a statistical test on each sub-group. So using mtcars as a sample data set, I can determine the set of unique carb values from:
mtcars %>% {unique(.$carb)}
gives [1] 4 1 2 3 6 8
Similarly, the set of unique gear values:
mtcars %>% {unique(.$gear)}
gives [1] 4 3 5
I'd like to iterate through the unique combinations of carb and gear and use this as a way to subset values within mtcars, so that I can perform a statistical test on each subset (as indexed by gear and carb). So the test would be:
data_subset %>% kruskal.test(.$mpg, .$am, data = .)
I've tried to do this using map from purrr. Something along the lines of:
library(purrr)
mtcars %>%
{unique(.$carb)} %>%
map2(mtcars, ~filter(.y, am == .x))
For most combinations of carb/gear in mtcars, there is only one value of am. From my limited understanding of the help & error messages, you need multiple groups (am in you example) to run the test.
library(tidyverse)
# Step 1 - limit to testable data
count(mtcars, carb, gear, am) %>%
count(carb, gear) %>% # Count am possibilities w/in each carb/gear group
filter(n > 1) %>%
left_join(mtcars) -> mtcars_mult_am
# Step 2 - nest, map each group to test, unnest
mtcars_mult_am %>%
nest(data = -c(carb, gear)) %>%
mutate(kruskal_raw = map(data, ~ kruskal.test(.x$mpg, .x$am)),
kruskal = map(kruskal_raw, broom::tidy)) %>%
select(-data) %>%
unnest(kruskal)
# A tibble: 2 x 7
carb gear kruskal_raw statistic p.value parameter method
<dbl> <dbl> <list> <dbl> <dbl> <int> <chr>
1 2 4 <S3: htest> 0 1 1 Kruskal-Wallis rank sum test
2 4 4 <S3: htest> 2.67 0.102 1 Kruskal-Wallis rank sum test

Using dplyr() to retrieve model object created via group_by() and do()

I'm trying to use dplyr and the pipe operator (%>%) to retrieve model objects stored in a dataframe.
With example data
library(dplyr)
set.seed(256)
dat <-
data.frame(x = rnorm(100),
y = rnorm(100, 10),
spec = sample(c("1", "2"), 100, TRUE)) %>%
group_by(spec) %>%
do(lm = lm(y ~ x, data = .))
I can subset and retrieve an actual model object
> dat$lm[dat$spec == "1"][[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
9.8171 -0.2292
> dat$lm[dat$spec == "1"][[1]] %>% class()
[1] "lm
But I think this is an inelegant way of retrieving the lm() model object contained therein, especially given that the rest of my code is structured the "dplyr way". I'd like to use dplyr but I can't figure out how. For example, using
dat %>% filter(spec == "1") %>% select(lm)
doesn't work as it returns
Source: local data frame [1 x 1]
Groups: <by row>
# A tibble: 1 x 1
lm
<list>
1 <S3: lm>
and
dat %>% filter(spec == "1") %>% .$lm
only gets me to the first object in list, e.g.,
> dat %>% filter(spec == "1") %>% .$lm
[[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
10.01495 -0.07438
I can't figure out a way to get to the actual model object in the dat with dplyr. Certainly, I could use broom and tidy() to condense everything
library(broom)
tidy(dat, lm)
but this still doesn't return the actual model object:
> tidy(dat, lm)
# A tibble: 4 x 6
# Groups: spec [2]
spec term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 10.0 0.120 83.3 1.91e-54
2 1 x - 0.0744 0.111 - 0.671 5.05e- 1
3 2 (Intercept) 9.86 0.131 75.0 1.42e-50
4 2 x - 0.0793 0.148 - 0.535 5.95e- 1
I can even use dplyr to summarise() the output from a do() call and retrieve the coefficients from the models, but this still doesn't give me the model object itself:
dat %>%
select(spec) %>%
bind_cols(dat %>%
summarize(lm_i = coefficients(lm)[[1]],
lm_s = coefficients(lm)[[2]]))
Is there a dplyr way to retrieve the actual model object from models created with do()?
do returns a list column, so to extract its individual elements, you need to use list subsetting. There are various ways to do that, but in the tidyverse, purrr::pluck is a nice option to extract a single [possibly deeply nested] element:
library(tidyverse)
dat %>% pluck('lm', 1)
#>
#> Call:
#> lm(formula = y ~ x, data = .)
#>
#> Coefficients:
#> (Intercept) x
#> 10.01495 -0.07438
It's mostly equivalent to [[ subsetting, i.e.
dat[['lm']][[1]]
To get what you have to work, you need to keep subsetting, as .$lm returns the list column, which in this case is a list of a model. .[[1]] (akin to the 1 above) extracts the model from the list:
dat %>% filter(spec == "1") %>% .$lm %>% .[[1]]
or a hybrid approach, if you like:
dat %>% filter(spec == "1") %>% pluck('lm', 1)
or use pull to extract the column with NSE semantics:
dat %>% filter(spec == "1") %>% pull(lm) %>% pluck(1)
All return the same thing.

Resources