I'm working with the ChickWeight data set in R. I'm looking to create multiple models, each trained for an individual chick. As such, I am nesting the data so that a dataframe is created for each individual chick and stored within the list column.
Here is the start:
library(tidyverse)
library(datasets)
data("ChickWeight")
ChickWeightNest <- ChickWeight %>%
group_by(Chick) %>%
nest()
From here, training a linear regression model on all dataframes simultaneously is very easy by simply building the model as a function then mutating a new column and mapping. However, building a more sophisticated model (e.g. xgboost) requires first splitting the data into testing and training sets. How can I split my all nested data frames at once to create training and testing sets so that I can train multiple models simultaneously?
As a side note, info on training/tuning multiple models seems to be relatively sparse in my research, any related resources or past stack questions would be very appreciated.
Maybe you want something like this where you first randomly sample train or test per chick in a new column to use later and group again to nest the data per group:
library(dplyr)
library(tidyr)
library(datasets)
data("ChickWeight")
ChickWeight %>%
group_by(Chick) %>%
rowwise() %>%
mutate(split = sample(c("train", "test"), n(), replace = FALSE)) %>%
group_by(Chick) %>%
nest()
#> # A tibble: 50 × 2
#> # Groups: Chick [50]
#> Chick data
#> <ord> <list>
#> 1 1 <tibble [12 × 4]>
#> 2 2 <tibble [12 × 4]>
#> 3 3 <tibble [12 × 4]>
#> 4 4 <tibble [12 × 4]>
#> 5 5 <tibble [12 × 4]>
#> 6 6 <tibble [12 × 4]>
#> 7 7 <tibble [12 × 4]>
#> 8 8 <tibble [11 × 4]>
#> 9 9 <tibble [12 × 4]>
#> 10 10 <tibble [12 × 4]>
#> # … with 40 more rows
Created on 2022-06-29 by the reprex package (v2.0.1)
The key here is realizing that each line of the nested data is a list and so you have to use list functions on it, for example lapply from base R or map from purrr.
Here's an example of how that would work using the rsample package to do the split (75% for training)
ChickWeightNest_example<- ChickWeightNest %>%
mutate(data_split = purrr::map(data,
~rsample::initial_split(.x, prop = .75))) %>%
mutate(data_training_only= purrr::map(data_split,
~rsample::training(.x)),
data_testing_only= purrr::map(data_split,
~rsample::testing(.x))
)
Related
I have a dataset with multiple columns for the outcome variables that I would like to predict with the same preprocessing steps and models. Is there a way to run the same recipe and models (with tuning - I'm using workflow_map()) on multiple outcome variables (separate models for each outcome)?
Essentially, I want loop through the same preprocessing steps and models for each outcome. Basically I want to avoid having to do this:
model_recipe1 <- recipe(outcome_1 ~ ., data) %>%
step_1
model_recipe2 <- recipe(outcome_2 ~ ., data) %>%
step_1
model_recipe3 <- recipe(outcome_3 ~ ., data) %>%
step_1
and would instead like to do something like this:
model_recipe <- recipe(outcome[i] ~ ., data) %>%
step_1
Try running this once before the rest of your code
set.seed(123)
If that doesn't solve it, try running this once at the start of your script:
addTaskCallback(function(...) {set.seed(123);TRUE})
Both of these methods try to ensure any random processes provide the same outcomes each time you run your script, allowing reproducibility.
I'm not sure if we 100% recommend the approach you are trying, but it will work in some circumstances:
library(tidymodels)
folds <- bootstraps(mtcars, times = 5)
wf_set <- workflow_set(list(mpg ~ ., wt ~ ., disp ~ .), list(linear_reg()))
workflow_map(wf_set, "fit_resamples", resamples = folds)
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 formula_1_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 2 formula_2_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 3 formula_3_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
Created on 2022-08-04 by the reprex package (v2.0.1)
To make many recipes in an iterative fashion, you'll need a bit of metaprogramming such as with rlang. You can write a function to take (in this case) a string and create a recipe:
library(rlang)
my_recipe <- function(outcome) {
form <- new_formula(ensym(outcome), expr(.))
recipe(form, data = mtcars) %>%
step_normalize(all_numeric_predictors())
}
And then you can use this function with purrr::map() across your outcomes:
library(tidymodels)
library(rlang)
folds <- bootstraps(mtcars, times = 5)
wf_set <- workflow_set(
map(c("mpg", "wt", "disp"), my_recipe),
list(linear_reg())
)
workflow_map(wf_set, "fit_resamples", resamples = folds)
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 recipe_1_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 2 recipe_2_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 3 recipe_3_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>
Created on 2022-08-04 by the reprex package (v2.0.1)
Problem
I have a list that has sets of nested lists. I need to test whether all of the dataframes in the lowest level are equal and I need to respect the grouping of the data while I do this test.
I am trying to solve the problem using purrr::map() but I am having real trouble understanding how I can iterate over each sub-list.
I have used gapminder in this example only because it can be nested twice, which is the same as my actual data (which I can't share here).
The data
library(dplyr)
library(gapminder)
library(purrr)
tf <- gapminder %>%
select(continent, country, year) %>%
group_by(continent, year) %>%
nest() %>%
arrange(desc(year)) %>%
ungroup() %>%
group_by(year) %>%
nest()
My attempt
tf$data[[1]] contains a list of data on each continent. It is these lists that I need to check for equality. This dataset produces unequal lists at this level but it doesn't matter, I just need the pattern for my actual data.
My attempt only allows me to iterate through one list the bottom level.
map_chr(tf$data[[1]]$data, all_equal, current = tf$data[[1]]$data[[1]])
I need to do this over all of the lists at the bottom level: for each year in tf, for each list in tf$data, for each continent in tf$data[[1]], for each list in tf$data[[1]]$data, compare whether the first list tf$data[[1]]$data[[1]] is equal to the other lists at that level.
Why not unnest the list one level? Then you can use all dplyr has to offer, like group-wise mutate:
tf %>%
unnest(data) %>%
mutate(equal_to_first = map_chr(data, all_equal, current = data[[1]])) %>%
unnest(equal_to_first)
Result:
# A tibble: 60 x 4
# Groups: year [12]
year continent data equal_to_first
<int> <fct> <list> <chr>
1 2007 Asia <tibble [33 × 1]> TRUE
2 2007 Europe <tibble [30 × 1]> Different number of rows
3 2007 Africa <tibble [52 × 1]> Different number of rows
4 2007 Americas <tibble [25 × 1]> Different number of rows
5 2007 Oceania <tibble [2 × 1]> Different number of rows
6 2002 Asia <tibble [33 × 1]> TRUE
7 2002 Europe <tibble [30 × 1]> Different number of rows
8 2002 Africa <tibble [52 × 1]> Different number of rows
9 2002 Americas <tibble [25 × 1]> Different number of rows
10 2002 Oceania <tibble [2 × 1]> Different number of rows
# … with 50 more rows
If you would like to get your original structure back, you can simply nest the result again.
I'm having trouble with what seems like a simple solution. I have a data frame with some locations and each location has a value associated with it. I nested the data.frame by the locations and then bootstrapped the values using purrr (see below).
library(tidyverse)
library(modelr)
library(purrr)
locations <- c("grave","pinkham","lower pinkham", "meadow", "dodge", "young")
values <- rnorm(n = 100, mean = 3, sd = .5)
df <- data.frame(df)
df.boot <- df %>%
nest(-locations) %>%
mutate(boot = map(data,~bootstrap(.,n=100, id = "values")))
Now I'm trying to get the median from each bootstrap in the final list df.boot$boot, but can't seem to figure it out? I've tried to apply map(boot, median) but the more I dig in the more that doesn't make sense. The wanted vector in the boot list is idx from which I can get the median value and then store it (pretty much what boot function does but iterating by unique categorical variables). Any help would be much appreciated. I might just be going at this the wrong way...
If we need to extract the median
library(dplyr)
library(purrr)
library(modelr)
out <- df %>%
group_by(locations) %>%
nest %>%
mutate(boot = map(data, ~ bootstrap(.x, n = 100, id = 'values') %>%
pull('strap') %>%
map_dbl(~ as_tibble(.x) %>%
pull('values') %>%
median)))
out
# A tibble: 6 x 3
# Groups: locations [6]
# locations data boot
# <fct> <list> <list>
#1 pinkham <tibble [12 × 1]> <dbl [100]>
#2 lower pinkham <tibble [17 × 1]> <dbl [100]>
#3 meadow <tibble [16 × 1]> <dbl [100]>
#4 dodge <tibble [22 × 1]> <dbl [100]>
#5 grave <tibble [21 × 1]> <dbl [100]>
#6 young <tibble [12 × 1]> <dbl [100]>
data
df <- data.frame(values, locations = sample(locations, 100, replace = TRUE))
I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.
This will be difficult to come up with a reproducible example for as there is no open source data as yet, and I'm not sure I'm allowed to share the data I have. I will try my best to explain it, and if this doesn't work, I can maybe take some time to simulate some data at a later point. Hopefully it's an easy solution though...
Background
I am busy creating an R package for kinetic modelling in the field that I work in (https://github.com/mathesong/kinfitr). I am trying as best as I can to make everything amenable to tidyverse tooling. However, there is a particular use case for which I can't figure out how to do it as it involves pulling data from several different formats in rather different structures, and pulling them together in the model.
In the README on the page, I present a solution for Reference Region models, where all inputs are of the same length and I can work with the following workflow:
data %>%
gather() %>%
group_by() %>%
do()
The Issue
However, for arterial models, the input arguments are as follows:
Brain kinetic data: times, values, weights - each vectors of the same length, in this case 38
Blood kinetic data: bloodinput - data frame of 4096 rows x 4 columns. For the sake of convenience, all models read this in as a data frame with all the information already interpolated.
Each of the models requires inputs of all three vectors, as well as the bloodinput data frame.
I currently have all the data stored in a list, with an element for each measurement. Each element of the list contains 1. a data frame with the brain kinetic data (each region of the brain, let's say 3 regions), as well as times and weights, and 2. a data frame containing the bloodinput data. Thus I create my final data frame
datdf <- map(dat, 'braindf') %>% # Extract the brain data
bind_rows(.id = "id") %>% # Add an id column
select(PET = id, Times = Times, Weights=weights, R1 = Region1, R2 = Region2, R3 = Region3) %>% # Rename and select columns
group_by(PET) %>% # Group by each measurement
nest() %>% # Nest everything
rename(braindata=data) %>% # Rename
mutate(Subjname = stringr::str_extract(....)), # Add subject acronym
PETNo = as.numeric(stringr::str_extract(....)), # Add measurement number
input=map(dat, 'bloodinput')) # Add blood input data frame as a nested column
This leaves me with the following
# A tibble: 6 × 5
PET braindata Subjname PETNo bloodinput
<chr> <list> <chr> <dbl> <list>
1 s1_1 <tibble [38 × 6]> s1 1 <data.frame [4,096 × 4]>
2 s1_2 <tibble [38 × 6]> s1 2 <data.frame [4,096 × 4]>
3 s2_1 <tibble [38 × 6]> s2 1 <data.frame [4,096 × 4]>
4 s2_2 <tibble [38 × 6]> s2 2 <data.frame [4,096 × 4]>
5 s1_1 <tibble [38 × 6]> s3 1 <data.frame [4,096 × 4]>
6 s2_2 <tibble [38 × 6]> s3 2 <data.frame [4,096 × 4]>
where each brain data contains the following:
head(datdf[1,]$braindata[[1]])
# A tibble: 6 × 6
Times Weights R1 R2 R3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0.00E+00 0.00E+00 0.00E+00
2 22 0.3 1.12E-03 4.14E-03 4.78E-04
3 32 0.5 5.61E-01 4.08E-01 7.38E-01
4 42 0.7 4.53E+01 4.50E+01 5.61E+01
5 52 0.7 8.12E+01 8.07E+01 1.02E+02
6 62 0.9 1.03E+02 1.04E+02 1.31E+02
From this point, I cannot figure out how to fit the model for each row.
This is what I have tried:
R1_outcomes <- datdf %>%
group_by(PET) %>% # or rowwise()
mutate(onetcmout = onetcm(t_tac=.$braindata[[1]]$Times/60,
tac=.$braindata[[1]]$R1,
input=.$bloodinput,
weights=.$braindata[[1]]$Weights))
R1_outcomes <- datdf %>%
rowwise() %>%
do(onetcmout = onetcm(t_tac=.$braindata[[1]]$Times/60,
tac=.$braindata[[1]]$R1,
input=.$bloodinput,
weights=.$braindata[[1]]$Weights))
I'm sure there's a way of doing this with the map functions, but I can't quite figure out how.
I would really appreciate any advice on how I might be able to do this. Thank you to anyone in advance!