I'm computing the model outputs for a linear regression for a dependent variable with 45 different id values. How can I use tidy (dplyr, apply, etc.) code to accomplish this?
I have a dataset with three variables data = c(id, distance, actPct) such that id == 1:45; -10 <= distance <= 10; 0 <= actsPct <= 1.
I need to run a regression, model0n, on each value of id, such that model0n has out put in a new tibble/df. I have completed it for a single regression:
model01 <- data %>%
filter(id == 1) %>%
filter(distance < 1) %>%
filter(distance > -4)
model01 <- lm(data = model01, actPct~distance)
Example Data
set.seed(42)
id <- as.tibble(sample(1:45,100,replace = T))
distance <- as.tibble(sample(-4:4,100,replace = T))
actPct <- as.tibble(runif(100, min=0, max=1))
data01 <- bind_cols(id=id, distance=distance, actPct=actPct)
attr(data01, "col.names") <- c("id", "distance", "actPct")
I expect a new tibble or dataframe that has model01:model45 so I can put all of the regression outputs into a single table.
You can use group_by, nest and mutate with map from the tidyverse to accomplish this:
data01 %>%
group_by(id) %>%
nest() %>%
mutate(models = map(data, ~ lm(actPct ~ distance, data = .x)))
# A tibble: 41 x 3
# id data models
# <int> <list> <list>
# 1 42 <tibble [3 x 2]> <S3: lm>
# 2 43 <tibble [4 x 2]> <S3: lm>
# 3 13 <tibble [2 x 2]> <S3: lm>
# 4 38 <tibble [4 x 2]> <S3: lm>
# 5 29 <tibble [2 x 2]> <S3: lm>
# 6 24 <tibble [5 x 2]> <S3: lm>
# 7 34 <tibble [5 x 2]> <S3: lm>
# 8 7 <tibble [3 x 2]> <S3: lm>
# 9 30 <tibble [2 x 2]> <S3: lm>
# 10 32 <tibble [2 x 2]> <S3: lm>
# ... with 31 more rows
See also the chapter in R for R for Data Science about many models: https://r4ds.had.co.nz/many-models.html
Data
set.seed(42)
id <- sample(1:45, 100, replace = T)
distance <- sample(-4:4, 100, replace = T)
actPct <- runif(100, min = 0, max = 1)
data01 <- tibble(id = id, distance = distance, actPct = actPct)
Related
Based on the response to this question that I posed I have something like the following:
library(tidyverse)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dataset <- tibble(
y1=rnorm(n=100),
y2=rnorm(n=100),
x1=rnorm(n=100),
x2=rnorm(n=100),)
outcomes <- dataset %>%
select(y1,y2) %>% colnames
covars <- dataset %>%
select(x1,x2) %>% colnames
paramlist <- expand_grid(outcomes, covars)
paramlist %>%
rowwise %>%
mutate(mod = list(lm(reformulate(outcomes, covars), data = dataset)),
res = list(broom::tidy(mod)),
predicted=list(predict(mod)),
data=list(cbind(dataset,predicted)))
# A tibble: 4 x 6
# Rowwise:
#> outcomes covars mod res predicted data
#> <chr> <chr> <list> <list> <list> <list>
#> 1 y1 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 2 y1 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 3 y2 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 4 y2 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
What I would like to do now is - for each combination of outcomes and covars - I'd like to calculate the mean or sd of the predicted value in data conditional on some value of x1. For example, x1 might be a treatment variable, and I'd like the adjusted mean of the outcome for those with x1=0. The tricky part seems to be that the outcome and conditioning variable differ across rows.
I have a data frame like this:
df <- tibble(
i = rep(1:10, times = 5),
t = rep(1:5, each = 10)
) %>%
mutate(y = rnorm(50))
I want to apply a function that takes data frame of each t as argument:
f <- function(df){
return(lm(y ~ +1, data = df))
}
When I apply purrr::map for a nested data frame with pipe operator, I get error.
# does not work
df_nested <- df %>%
nest(data = c(t, y)) %>%
rename(data_col = data)
df_nested %>%
purrr::map(.x = .$data_col, .f = f)
On the other hand, when I do not use pipe operator, I get the desired result.
# Ok
purrr::map(.x = df_nested$data_col, .f = f)
To my understanding, both code should return the same result. What is wrong with the code with pipe operator?
Pipe already passes the previous value (df_nested) as the first argument to map. You may use {} to stop that from happening.
library(tidyverse)
df_nested %>%
{purrr::map(.x = .$data_col, .f = f)}
Another way would be to use -
df %>%
nest(data_col = c(t, y)) %>%
mutate(model = map(data_col, f))
# i data_col model
# <int> <list> <list>
# 1 1 <tibble [5 × 2]> <lm>
# 2 2 <tibble [5 × 2]> <lm>
# 3 3 <tibble [5 × 2]> <lm>
# 4 4 <tibble [5 × 2]> <lm>
# 5 5 <tibble [5 × 2]> <lm>
# 6 6 <tibble [5 × 2]> <lm>
# 7 7 <tibble [5 × 2]> <lm>
# 8 8 <tibble [5 × 2]> <lm>
# 9 9 <tibble [5 × 2]> <lm>
#10 10 <tibble [5 × 2]> <lm>
I'm having trouble with what seems like a simple solution. I have a data frame with some locations and each location has a value associated with it. I nested the data.frame by the locations and then bootstrapped the values using purrr (see below).
library(tidyverse)
library(modelr)
library(purrr)
locations <- c("grave","pinkham","lower pinkham", "meadow", "dodge", "young")
values <- rnorm(n = 100, mean = 3, sd = .5)
df <- data.frame(df)
df.boot <- df %>%
nest(-locations) %>%
mutate(boot = map(data,~bootstrap(.,n=100, id = "values")))
Now I'm trying to get the median from each bootstrap in the final list df.boot$boot, but can't seem to figure it out? I've tried to apply map(boot, median) but the more I dig in the more that doesn't make sense. The wanted vector in the boot list is idx from which I can get the median value and then store it (pretty much what boot function does but iterating by unique categorical variables). Any help would be much appreciated. I might just be going at this the wrong way...
If we need to extract the median
library(dplyr)
library(purrr)
library(modelr)
out <- df %>%
group_by(locations) %>%
nest %>%
mutate(boot = map(data, ~ bootstrap(.x, n = 100, id = 'values') %>%
pull('strap') %>%
map_dbl(~ as_tibble(.x) %>%
pull('values') %>%
median)))
out
# A tibble: 6 x 3
# Groups: locations [6]
# locations data boot
# <fct> <list> <list>
#1 pinkham <tibble [12 × 1]> <dbl [100]>
#2 lower pinkham <tibble [17 × 1]> <dbl [100]>
#3 meadow <tibble [16 × 1]> <dbl [100]>
#4 dodge <tibble [22 × 1]> <dbl [100]>
#5 grave <tibble [21 × 1]> <dbl [100]>
#6 young <tibble [12 × 1]> <dbl [100]>
data
df <- data.frame(values, locations = sample(locations, 100, replace = TRUE))
I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.
I'm trying to extract model info from model in a list column.
Using mtcars to illustrate my problem:
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic=AIC(model))
what I got is error message:
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'logLik' applied to an object of class "list".
But when I do it this way, it works.
mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.)) %>%
mutate(aic=AIC(model))
Can anyone explain why? Why the second way works? I could not figure it out. In both cases, the list column 'model' contains model info . But there might be some differences... Thanks a lot.
Let's compare the differences between these two approaches. We can run your entire code in addition to the last AIC call and save the results to a and b.
a <- mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.)))
b <- mtcars %>%
group_by(cyl) %>%
do(model= lm(mpg~wt, data=.))
Now we can print the results in the console.
a
# A tibble: 3 x 3
cyl data model
<dbl> <list> <list>
1 6 <tibble [7 x 10]> <S3: lm>
2 4 <tibble [11 x 10]> <S3: lm>
3 8 <tibble [14 x 10]> <S3: lm>
b
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
cyl model
* <dbl> <list>
1 4 <S3: lm>
2 6 <S3: lm>
3 8 <S3: lm>
Now we can see dataframe b is grouped by row, while dataframe a is not. This is the key.
To extract AIC in dataframe a, we can use the rowwise function to group dataframe by each row.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
rowwise() %>%
mutate(aic=AIC(model))
Source: local data frame [3 x 4]
Groups: <by row>
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555
Or we can use the map_dbl function because we know each AIC is numeric.
mtcars %>%
nest(-cyl) %>%
mutate(model= map(data, ~lm(mpg~wt, data=.))) %>%
mutate(aic = map_dbl(model, AIC))
# A tibble: 3 x 4
cyl data model aic
<dbl> <list> <list> <dbl>
1 6 <tibble [7 x 10]> <S3: lm> 25.65036
2 4 <tibble [11 x 10]> <S3: lm> 61.48974
3 8 <tibble [14 x 10]> <S3: lm> 63.31555