nesting categorical variable, bootstrap, then extract median in R - r

I'm having trouble with what seems like a simple solution. I have a data frame with some locations and each location has a value associated with it. I nested the data.frame by the locations and then bootstrapped the values using purrr (see below).
library(tidyverse)
library(modelr)
library(purrr)
locations <- c("grave","pinkham","lower pinkham", "meadow", "dodge", "young")
values <- rnorm(n = 100, mean = 3, sd = .5)
df <- data.frame(df)
df.boot <- df %>%
nest(-locations) %>%
mutate(boot = map(data,~bootstrap(.,n=100, id = "values")))
Now I'm trying to get the median from each bootstrap in the final list df.boot$boot, but can't seem to figure it out? I've tried to apply map(boot, median) but the more I dig in the more that doesn't make sense. The wanted vector in the boot list is idx from which I can get the median value and then store it (pretty much what boot function does but iterating by unique categorical variables). Any help would be much appreciated. I might just be going at this the wrong way...

If we need to extract the median
library(dplyr)
library(purrr)
library(modelr)
out <- df %>%
group_by(locations) %>%
nest %>%
mutate(boot = map(data, ~ bootstrap(.x, n = 100, id = 'values') %>%
pull('strap') %>%
map_dbl(~ as_tibble(.x) %>%
pull('values') %>%
median)))
out
# A tibble: 6 x 3
# Groups: locations [6]
# locations data boot
# <fct> <list> <list>
#1 pinkham <tibble [12 × 1]> <dbl [100]>
#2 lower pinkham <tibble [17 × 1]> <dbl [100]>
#3 meadow <tibble [16 × 1]> <dbl [100]>
#4 dodge <tibble [22 × 1]> <dbl [100]>
#5 grave <tibble [21 × 1]> <dbl [100]>
#6 young <tibble [12 × 1]> <dbl [100]>
data
df <- data.frame(values, locations = sample(locations, 100, replace = TRUE))

Related

Extract residuals from models fit in purrr

I grouped my data and fit a model to each group and I would like to have the residuals for each group. I can see the residuals for each model with RStudio's viewer, but I cannot figure out how to extract them. Extracting one set of residuals can be done like diamond_mods[[3]][[1]][["residuals"]], but how do I use purrr to extract the set from every group (along with broom to end up with a nice tibble)?
Below is how far I've gotten:
library(tidyverse)
library(purrr)
library(broom)
fit_mod <- function(df) {
lm(price ~ poly(carat, 2, raw = TRUE), data = df)
}
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy)
#resid = map_dbl(model, "residuals") #this was my best try, it doesn't work
) %>%
unnest(tidied)
You were close - but you should use map() instead of map_dbl() as you need to return a list not a vector.
diamond_mods <- diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model = map(data, fit_mod),
tidied = map(model, tidy),
resid = map(model, residuals)
)
With the devel version of dplyr, we can do this in condense after grouping by 'cut'
library(dplyr)
library(ggplot2)
library(broom)
diamonds %>%
group_by(cut) %>%
condense(model = fit_mod(cur_data()),
tidied = tidy(model),
resid = model[["residuals"]])
# A tibble: 5 x 4
# Rowwise: cut
# cut model tidied resid
# <ord> <list> <list> <list>
#1 Fair <lm> <tibble [3 × 5]> <dbl [1,610]>
#2 Good <lm> <tibble [3 × 5]> <dbl [4,906]>
#3 Very Good <lm> <tibble [3 × 5]> <dbl [12,082]>
#4 Premium <lm> <tibble [3 × 5]> <dbl [13,791]>
#5 Ideal <lm> <tibble [3 × 5]> <dbl [21,551]>

Get Mutate Error When Applying Purrr::Map on Grouped Data

Hi I am trying to apply a very simple function by using purrr::map however i keep getting the error Error in mutate_impl(.data, dots) :
Evaluation error: unused argument (.x[[i]]).
The codes are as below:
data = data.frame(name = c('A', 'B', 'C'), metric = c(0.29, 0.39,0.89))
get_sample_size = function(metric, threshold = 0.01){
sample_size = ceiling((1.96^2)*(metric*(1-metric))/(threshold^2))
return(data.frame(sample_size))
}
data %>% group_by(name) %>% tidyr::nest() %>%
dplyr::mutate(result = purrr::map( .x = data, .f = get_sample_size, metric = metric, threshold = 0.01 ))
You don't need nest. The metric argument from get_sample_size function should be a numeric vector, but if you do nest, the data column is a list of data frame, which cannot be the input for the metric argument.
I think you can use summarize and map to apply your function to the metric column.
library(tidyverse)
data %>%
group_by(name) %>%
summarize(result = purrr::map(.x = metric,
.f = get_sample_size,
threshold = 0.01))
# # A tibble: 3 x 2
# name result
# <fct> <list>
# 1 A <data.frame [1 x 1]>
# 2 B <data.frame [1 x 1]>
# 3 C <data.frame [1 x 1]>
When you pass metric in the ... part of map, it's not clear that that is a column in the nested data frame. But once you nest the data like you've done, metric isn't a column in data, it's a column in the nested frame...also called "data." (This is a good example of why you want more specific variable names btw.)
If you're mapping over the data column, you can use $metric to point to that column, either in writing out a function, as I've done here (such as df$metric), or in formula notation (such as .$metric).
As #www said, you don't need nested data frames in this case. But for a more complicated case, you might need nested data frames to work with, such as for building models, so it's good to know how to reference exactly the data you want.
library(tidyverse)
data %>%
group_by(name) %>%
tidyr::nest() %>%
mutate(result = map(data, function(df) {
get_sample_size(metric = df$metric, threshold = 0.01)
}))
#> # A tibble: 3 x 3
#> name data result
#> <fct> <list> <list>
#> 1 A <tibble [1 × 1]> <data.frame [1 × 1]>
#> 2 B <tibble [1 × 1]> <data.frame [1 × 1]>
#> 3 C <tibble [1 × 1]> <data.frame [1 × 1]>
Created on 2019-01-16 by the reprex package (v0.2.1)

Use apply() to iterate linear regression models through multiple dependent variables

I'm computing the model outputs for a linear regression for a dependent variable with 45 different id values. How can I use tidy (dplyr, apply, etc.) code to accomplish this?
I have a dataset with three variables data = c(id, distance, actPct) such that id == 1:45; -10 <= distance <= 10; 0 <= actsPct <= 1.
I need to run a regression, model0n, on each value of id, such that model0n has out put in a new tibble/df. I have completed it for a single regression:
model01 <- data %>%
filter(id == 1) %>%
filter(distance < 1) %>%
filter(distance > -4)
model01 <- lm(data = model01, actPct~distance)
Example Data
set.seed(42)
id <- as.tibble(sample(1:45,100,replace = T))
distance <- as.tibble(sample(-4:4,100,replace = T))
actPct <- as.tibble(runif(100, min=0, max=1))
data01 <- bind_cols(id=id, distance=distance, actPct=actPct)
attr(data01, "col.names") <- c("id", "distance", "actPct")
I expect a new tibble or dataframe that has model01:model45 so I can put all of the regression outputs into a single table.
You can use group_by, nest and mutate with map from the tidyverse to accomplish this:
data01 %>%
group_by(id) %>%
nest() %>%
mutate(models = map(data, ~ lm(actPct ~ distance, data = .x)))
# A tibble: 41 x 3
# id data models
# <int> <list> <list>
# 1 42 <tibble [3 x 2]> <S3: lm>
# 2 43 <tibble [4 x 2]> <S3: lm>
# 3 13 <tibble [2 x 2]> <S3: lm>
# 4 38 <tibble [4 x 2]> <S3: lm>
# 5 29 <tibble [2 x 2]> <S3: lm>
# 6 24 <tibble [5 x 2]> <S3: lm>
# 7 34 <tibble [5 x 2]> <S3: lm>
# 8 7 <tibble [3 x 2]> <S3: lm>
# 9 30 <tibble [2 x 2]> <S3: lm>
# 10 32 <tibble [2 x 2]> <S3: lm>
# ... with 31 more rows
See also the chapter in R for R for Data Science about many models: https://r4ds.had.co.nz/many-models.html
Data
set.seed(42)
id <- sample(1:45, 100, replace = T)
distance <- sample(-4:4, 100, replace = T)
actPct <- runif(100, min = 0, max = 1)
data01 <- tibble(id = id, distance = distance, actPct = actPct)

Use purrr to map to 2 functions

I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.

How to add calculated columns to nested data frames (list columns) using purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.
Load libraries. example requires the following packages (available on CRAN):
library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean
Example data with 3 subjects, and repeated measurements over time:
test <- data_frame(
id= rep(1:3, each=20),
time = rep(1:20, 3),
var1 = rnorm(60, mean=10, sd=3),
var2 = rnorm(60, mean=95, sd=5)
)
Store the data as nested dataframe:
t_nest <- test %>% nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 3]>
2 2 <tibble [20 x 3]>
3 3 <tibble [20 x 3]>
Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):
t1 <- t_nest %>%
mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))
id data var1_rollmean4 var2_delta4 var3
<int> <list> <list> <list> <list>
1 1 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
2 2 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
3 3 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.
t1 %>% unnest %>%
nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 6]>
2 2 <tibble [20 x 6]>
3 3 <tibble [20 x 6]>
This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:
map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))
I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.
You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:
t11 <- t_nest %>%
mutate(data = map(data,
~ mutate(.x,
var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = var1_rollmean4 - var2_delta4
)
))
t11
# A tibble: 3 x 2
# id data
# <int> <list>
#1 1 <tibble [20 x 6]>
#2 2 <tibble [20 x 6]>
#3 3 <tibble [20 x 6]>
unnest-nest method, and then reorder the columns inside:
nest_unnest <- t1 %>%
unnest %>% nest(-id) %>%
mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))
identical(nest_unnest, t11)
# [1] TRUE
It seems like for what you're trying to do, nesting is not necessary
library(tidyverse)
library(zoo)
test %>%
group_by(id) %>%
mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = (var1_rollmean4 - var2_delta4))
# A tibble: 60 x 7
# Groups: id [3]
# id time var1 var2 var1_rollmean4 var2_delta4 var3
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9.865199 96.45723 NA NA NA
# 2 1 2 9.951429 92.78354 NA NA NA
# 3 1 3 12.831509 95.00553 NA NA NA
# 4 1 4 12.463664 95.37171 11.277950 -0.10312483 11.381075
# 5 1 5 11.781704 92.05240 11.757076 -0.06945881 11.826535
# 6 1 6 12.756932 92.15666 12.458452 -0.27064269 12.729095
# 7 1 7 12.346409 94.32411 12.337177 -0.09952197 12.436699
# 8 1 8 10.223695 100.89043 11.777185 0.83961377 10.937571
# 9 1 9 4.031945 87.38217 9.839745 -0.45357658 10.293322
# 10 1 10 11.859477 97.96973 9.615382 0.34633428 9.269047
# ... with 50 more rows
Edit You could nest the result with %>% nest(-id) still
If you still prefer to nest or are nesting for other reasons, it would go like
t1 <- t_nest %>%
mutate(data = map(data, ~.x %>% mutate(...)))
That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

Resources