Creating multiple training subsets using sample() in R - r

I have a training dataset that consists of 60,000 observations that I want to create 9 subset training sets from. I want to sample randomly without replacement; I need 3 datasets of 500 observations, 3 datasets of 1,000 observations, and 3 datasets of 2,000 observations.
How can I do this using sample() in R?

Given your data.frame is named df you do:
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- sample(60000, sum(sample_sizes))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))
This do sampling without replacement over all dataset.
If you want sampling without replacement in each training set (but not through all training sets):
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))

I'm not positive if you want the output to look like the screenshot, but if so, here you go:
library(tidyverse)
df <- tibble(rand = runif(6e4))
tibble(`Sample Size` = rep(c(500,1000,2000), each = 3)) |>
mutate(name = rep(paste(c("First", "Second", "Third"), "Random Sample"), 3),
samp = map2(`Sample Size`, row_number(),
\(x,y) {set.seed(y); df[sample(1:nrow(df), size = x),]})) |>
pivot_wider(names_from = name, values_from = samp)
#> # A tibble: 3 x 4
#> `Sample Size` `First Random Sample` `Second Random Sample` Third Random Samp~1
#> <dbl> <list> <list> <list>
#> 1 500 <tibble [500 x 1]> <tibble [500 x 1]> <tibble [500 x 1]>
#> 2 1000 <tibble [1,000 x 1]> <tibble [1,000 x 1]> <tibble>
#> 3 2000 <tibble [2,000 x 1]> <tibble [2,000 x 1]> <tibble>
#> # ... with abbreviated variable name 1: `Third Random Sample`

Related

Combine nesting and rolling_origin from Tidymodels in R

I am trying to train a random forest using rolling_origin from the Tidymodels suite. I would like the folds to be exactly the months of the year. Nesting looks like it could do the trick, but tune_grid is not able to find the variables when the data is nested. How can I make this work? I put a reproducible example below.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(yardstick))
# Create dummy data ====================================================================================================
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- data.frame(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
# Random Forest Model =================================================================================================
model <-
parsnip::rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
# grid specification
params <-
dials::parameters(
trees()
)
# Set up grid and model workflow =======================================================================================
grid <-
dials::grid_max_entropy(
params,
size = 2
)
form <- as.formula(paste("y ~ v1 + v2 + v3"))
model_workflow <-
workflows::workflow() %>%
add_model(model) %>%
add_formula(form)
# Tuning on the normal data set works ====================================================================================================
data_ro_day <- data_set %>%
rolling_origin(
initial = 304,
assess = 30,
cumulative = TRUE,
skip = 30
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_day,
param_info = params,
metrics = metric_set(mae, mape, rmse, rsq),
control = control_grid(verbose = TRUE))
results %>% show_best("mape", n = 2)
# Tuning on the nested data set doesn't work =========================================================================================
data_ro_month <- data_set %>%
mutate(year_month = format(date, "%Y-%m")) %>%
nest(-year_month) %>%
rolling_origin(
initial = 10,
assess = 1,
cumulative = TRUE
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_month,
param_info = params,
metrics = metric_set(mae, mape, rmse, rsq),
control = control_grid(verbose = TRUE))
results$.notes ```
I'm not entirely clear on how you want to divide up your data for tuning, but I would recommend looking into some of the other rsample functions like sliding_window() and especially sliding_period(). They let you create experimental designs for tuning where you can fit on certain months of data and then asses on another month, sliding along all the months you have available:
library(tidymodels)
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- tibble(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
month_folds <- data_set %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4
)
month_folds
#> # Sliding period resampling
#> # A tibble: 7 x 2
#> splits id
#> <list> <chr>
#> 1 <split [151/30]> Slice1
#> 2 <split [181/31]> Slice2
#> 3 <split [212/31]> Slice3
#> 4 <split [243/30]> Slice4
#> 5 <split [273/31]> Slice5
#> 6 <split [304/30]> Slice6
#> 7 <split [334/31]> Slice7
I used skip = 4 here to only keep slices where you will have more data for training. Each of these slices will training on several months of data and assess on a new, last month. The resamples slide forward through your dataset. Since I used lookback = Inf it always includes all past data, but you can change that.
When you have your resampling approach set up however is appropriate for your domain problem, you can then make a model specification and tune it:
rf_spec <-
rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
rf_wf <-
workflow() %>%
add_model(rf_spec) %>%
add_formula(y ~ v1 + v2 + v3)
tune_grid(rf_wf, resamples = month_folds)
#> # Tuning results
#> # Sliding period resampling
#> # A tibble: 7 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [151/30]> Slice1 <tibble [20 × 5]> <tibble [0 × 1]>
#> 2 <split [181/31]> Slice2 <tibble [20 × 5]> <tibble [0 × 1]>
#> 3 <split [212/31]> Slice3 <tibble [20 × 5]> <tibble [0 × 1]>
#> 4 <split [243/30]> Slice4 <tibble [20 × 5]> <tibble [0 × 1]>
#> 5 <split [273/31]> Slice5 <tibble [20 × 5]> <tibble [0 × 1]>
#> 6 <split [304/30]> Slice6 <tibble [20 × 5]> <tibble [0 × 1]>
#> 7 <split [334/31]> Slice7 <tibble [20 × 5]> <tibble [0 × 1]>
Created on 2020-11-15 by the reprex package (v0.3.0.9001)

How to map a nested dataframe, and store multiple columns as output

I have a data structure as follows:
test <- data.frame(
id= rep(1:3, each=20),
count = rnorm(60, mean=5, sd=1),
covar1 = rnorm(60, mean=10, sd=3),
covar2 = rnorm(60, mean=95, sd=5),
covar3 = rnorm(60, mean=30, sd=5)
)
Then I nest it by id:
test <- test %>% nest(-id)
I want to apply a model to each data covar column, for a given id. Then I want to store the result in a separate column. I can do this as follows:
test <- test %>% mutate(covar1_lm = map(data, ~lm(count ~ covar1, data=.x)),
covar2_lm = map(data, ~lm(count ~ covar2, data=.x)),
covar3_lm = map(data, ~lm(count ~ covar3, data=.x)))
Which gives the output I want:
> test
# A tibble: 3 x 5
id data covar1_lm covar2_lm covar3_lm
<int> <list> <list> <list> <list>
1 1 <tibble [20 × 4]> <lm> <lm> <lm>
2 2 <tibble [20 × 4]> <lm> <lm> <lm>
3 3 <tibble [20 × 4]> <lm> <lm> <lm>
The problem is my real data has a large number of covar columns, and so I'd like to reduce the boilerplate code. So I'm guessing I need some concept of dynamic variable names, but I cant figure out how to map over a dynamic set of column names??
You can pivot_longer() the dataset first, so that there is one observation (row) for each covariate for each dataset. Then you perform the model within each covariate.
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
summarize(model = list(lm(count ~ value)))
You now have one observation for each combination of ID and covariate.
# A tibble: 9 x 3
# Groups: id [3]
id covariate model
<int> <chr> <list>
1 1 covar1 <lm>
2 1 covar2 <lm>
3 1 covar3 <lm>
4 2 covar1 <lm>
5 2 covar2 <lm>
6 2 covar3 <lm>
7 3 covar1 <lm>
8 3 covar2 <lm>
9 3 covar3 <lm>
If you want to turn that into the same kind of result, you could pipe this to pivot_wider(names_from = covariate, values_from = model). (But note that having one row for each model could make it easier to explore and visualize the models, especially if you tidy each with broom::tidy() and unnested them).
An alternative to the group_by()/summarize() above would be to nest them :
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
nest() %>%
mutate(model = map(data, ~ lm(count ~ value, data = .x)))

nesting categorical variable, bootstrap, then extract median in R

I'm having trouble with what seems like a simple solution. I have a data frame with some locations and each location has a value associated with it. I nested the data.frame by the locations and then bootstrapped the values using purrr (see below).
library(tidyverse)
library(modelr)
library(purrr)
locations <- c("grave","pinkham","lower pinkham", "meadow", "dodge", "young")
values <- rnorm(n = 100, mean = 3, sd = .5)
df <- data.frame(df)
df.boot <- df %>%
nest(-locations) %>%
mutate(boot = map(data,~bootstrap(.,n=100, id = "values")))
Now I'm trying to get the median from each bootstrap in the final list df.boot$boot, but can't seem to figure it out? I've tried to apply map(boot, median) but the more I dig in the more that doesn't make sense. The wanted vector in the boot list is idx from which I can get the median value and then store it (pretty much what boot function does but iterating by unique categorical variables). Any help would be much appreciated. I might just be going at this the wrong way...
If we need to extract the median
library(dplyr)
library(purrr)
library(modelr)
out <- df %>%
group_by(locations) %>%
nest %>%
mutate(boot = map(data, ~ bootstrap(.x, n = 100, id = 'values') %>%
pull('strap') %>%
map_dbl(~ as_tibble(.x) %>%
pull('values') %>%
median)))
out
# A tibble: 6 x 3
# Groups: locations [6]
# locations data boot
# <fct> <list> <list>
#1 pinkham <tibble [12 × 1]> <dbl [100]>
#2 lower pinkham <tibble [17 × 1]> <dbl [100]>
#3 meadow <tibble [16 × 1]> <dbl [100]>
#4 dodge <tibble [22 × 1]> <dbl [100]>
#5 grave <tibble [21 × 1]> <dbl [100]>
#6 young <tibble [12 × 1]> <dbl [100]>
data
df <- data.frame(values, locations = sample(locations, 100, replace = TRUE))

Get Mutate Error When Applying Purrr::Map on Grouped Data

Hi I am trying to apply a very simple function by using purrr::map however i keep getting the error Error in mutate_impl(.data, dots) :
Evaluation error: unused argument (.x[[i]]).
The codes are as below:
data = data.frame(name = c('A', 'B', 'C'), metric = c(0.29, 0.39,0.89))
get_sample_size = function(metric, threshold = 0.01){
sample_size = ceiling((1.96^2)*(metric*(1-metric))/(threshold^2))
return(data.frame(sample_size))
}
data %>% group_by(name) %>% tidyr::nest() %>%
dplyr::mutate(result = purrr::map( .x = data, .f = get_sample_size, metric = metric, threshold = 0.01 ))
You don't need nest. The metric argument from get_sample_size function should be a numeric vector, but if you do nest, the data column is a list of data frame, which cannot be the input for the metric argument.
I think you can use summarize and map to apply your function to the metric column.
library(tidyverse)
data %>%
group_by(name) %>%
summarize(result = purrr::map(.x = metric,
.f = get_sample_size,
threshold = 0.01))
# # A tibble: 3 x 2
# name result
# <fct> <list>
# 1 A <data.frame [1 x 1]>
# 2 B <data.frame [1 x 1]>
# 3 C <data.frame [1 x 1]>
When you pass metric in the ... part of map, it's not clear that that is a column in the nested data frame. But once you nest the data like you've done, metric isn't a column in data, it's a column in the nested frame...also called "data." (This is a good example of why you want more specific variable names btw.)
If you're mapping over the data column, you can use $metric to point to that column, either in writing out a function, as I've done here (such as df$metric), or in formula notation (such as .$metric).
As #www said, you don't need nested data frames in this case. But for a more complicated case, you might need nested data frames to work with, such as for building models, so it's good to know how to reference exactly the data you want.
library(tidyverse)
data %>%
group_by(name) %>%
tidyr::nest() %>%
mutate(result = map(data, function(df) {
get_sample_size(metric = df$metric, threshold = 0.01)
}))
#> # A tibble: 3 x 3
#> name data result
#> <fct> <list> <list>
#> 1 A <tibble [1 × 1]> <data.frame [1 × 1]>
#> 2 B <tibble [1 × 1]> <data.frame [1 × 1]>
#> 3 C <tibble [1 × 1]> <data.frame [1 × 1]>
Created on 2019-01-16 by the reprex package (v0.2.1)

Use apply() to iterate linear regression models through multiple dependent variables

I'm computing the model outputs for a linear regression for a dependent variable with 45 different id values. How can I use tidy (dplyr, apply, etc.) code to accomplish this?
I have a dataset with three variables data = c(id, distance, actPct) such that id == 1:45; -10 <= distance <= 10; 0 <= actsPct <= 1.
I need to run a regression, model0n, on each value of id, such that model0n has out put in a new tibble/df. I have completed it for a single regression:
model01 <- data %>%
filter(id == 1) %>%
filter(distance < 1) %>%
filter(distance > -4)
model01 <- lm(data = model01, actPct~distance)
Example Data
set.seed(42)
id <- as.tibble(sample(1:45,100,replace = T))
distance <- as.tibble(sample(-4:4,100,replace = T))
actPct <- as.tibble(runif(100, min=0, max=1))
data01 <- bind_cols(id=id, distance=distance, actPct=actPct)
attr(data01, "col.names") <- c("id", "distance", "actPct")
I expect a new tibble or dataframe that has model01:model45 so I can put all of the regression outputs into a single table.
You can use group_by, nest and mutate with map from the tidyverse to accomplish this:
data01 %>%
group_by(id) %>%
nest() %>%
mutate(models = map(data, ~ lm(actPct ~ distance, data = .x)))
# A tibble: 41 x 3
# id data models
# <int> <list> <list>
# 1 42 <tibble [3 x 2]> <S3: lm>
# 2 43 <tibble [4 x 2]> <S3: lm>
# 3 13 <tibble [2 x 2]> <S3: lm>
# 4 38 <tibble [4 x 2]> <S3: lm>
# 5 29 <tibble [2 x 2]> <S3: lm>
# 6 24 <tibble [5 x 2]> <S3: lm>
# 7 34 <tibble [5 x 2]> <S3: lm>
# 8 7 <tibble [3 x 2]> <S3: lm>
# 9 30 <tibble [2 x 2]> <S3: lm>
# 10 32 <tibble [2 x 2]> <S3: lm>
# ... with 31 more rows
See also the chapter in R for R for Data Science about many models: https://r4ds.had.co.nz/many-models.html
Data
set.seed(42)
id <- sample(1:45, 100, replace = T)
distance <- sample(-4:4, 100, replace = T)
actPct <- runif(100, min = 0, max = 1)
data01 <- tibble(id = id, distance = distance, actPct = actPct)

Resources