I am trying to calculate AUC, Precision, Recall, Accuracy for every group in my data frame (i have a single data frame that has predicted data from three different models concatenated).
What is the tidyverse syntax to do it? I want to use the yardstick package by Max Kuhn to calculate these metrics.
Here is a sample df and here is where I got so far:
> library(tidyverse)
> library(yardstick)
>
> sample_df <- data_frame(
+ group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
+ true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
+ pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
+ ) %>%
+ mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
>
> sample_df
# A tibble: 15 x 4
group_type true_label pred_prob pred_label
<chr> <fct> <dbl> <fct>
1 a 1 0.327 0
2 a 1 0.286 0
3 a 0 0.0662 0
4 a 0 0.993 1
5 a 0 0.835 1
6 b 0 0.975 1
7 b 0 0.436 0
8 b 0 0.585 1
9 b 0 0.478 0
10 b 1 0.541 1
11 c 1 0.247 0
12 c 0 0.608 1
13 c 0 0.215 0
14 c 0 0.937 1
15 c 0 0.819 1
>
Metrics:
> # metrics for the full data
> precision(sample_df, truth = true_label, estimate = pred_label)
[1] 0.5714286
> recall(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3636364
> accuracy(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3333333
> roc_auc(sample_df, truth = true_label, pred_prob)
[1] 0.7727273
>
Now how do i get these metrics for each group in my dataset??
sample_df %>%
group_by(group_type) %>%
summarize(???)
An example using unnest:
sample_df %>%
group_by(group_type) %>%
do(auc = roc_auc(., true_label, pred_prob),
acc = accuracy(., true_label, pred_label),
recall = recall(., true_label, pred_label),
precision = precision(., true_label, pred_label)) %>% unnest
HOWEVER,
I would actually suggest to not use yardstick because it doesn't play nice with dplyr summarize. Actually, it just uses the ROCR package under the hood. I would just make your own functions that take in two variables.
yardstick is flawed because it requires a data.frame as it's first input, it is trying to be too clever. Under the dplyr framework, that isn't necessary because of summarize and mutate as functions already see the variables inside a data.frame without an explicit data parameter.
As others have noted, the functions in yardstick don't really play nice with grouped data frames (at least as of yet). A workaround could be to work with nested data.
In order to reduce replication, it's probably also a good idea to write a simple wrapper function that calculates all of the summary metrics you want in one call. Here's an example of how you could go about doing just that:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
Set up first:
library(tidyverse)
library(yardstick)
set.seed(1)
# Given sample data
sample_df <- data_frame(
group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
#> Warning: package 'bindrcpp' was built under R version 3.3.3
Here are the wrappers:
# Wrapper to calculate several metrics from same data
performance_metrics <- function(data, truth, estimate, prob) {
metrics <- lst(precision, recall, accuracy) # these all share arguments
values <- invoke_map_df(metrics, list(list(data)), truth, estimate)
roc <- roc_auc(sample_df, truth, prob) # bit different here
bind_cols(values, roc_auc = roc)
}
# Wrap the wrapper with default arguments
metrics <- partial(performance_metrics,
truth = "true_label",
estimate = "pred_label",
prob = "pred_prob")
And apply to groups by nesting the data:
sample_df %>%
nest(-group_type) %>%
mutate(metrics = map(data, metrics)) %>%
unnest(metrics)
#> # A tibble: 3 x 6
#> group_type data precision recall accuracy roc_auc
#> <chr> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 a <tibble [5 x 3]> 0.5000000 0.2500000 0.2 0.5909091
#> 2 b <tibble [5 x 3]> 0.6666667 0.6666667 0.6 0.5909091
#> 3 c <tibble [5 x 3]> 0.7500000 0.7500000 0.6 0.5909091
I managed to do it by spiting a data frame to a list and mapping the function to each list element:
library(tidyverse)
library(yardstick)
sample_df %>%
split(.$group_type) %>%
map_dfr(precision, true_label, pred_label)
#output
## A tibble: 1 x 3
a b c
<dbl> <dbl> <dbl>
1 0.500 0.667 1.00
it seems group_by is not supported by yardstick functions yet
This also works:
sample_df %>%
split(.$group_type) %>%
map_dfr(function(x){
prec = precision(x, true_label, pred_label)
rec = recall(x, true_label, pred_label)
return(data.frame(prec, rec))
})
I used the example in http://r4ds.had.co.nz/many-models.html
It uses nest but also uses precision as you requested.
library(tidyverse)
library(yardstick)
sample_df <- data_frame(group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
by_group_type <- sample_df %>% group_by(group_type) %>% nest()
stick_m_1 <- function(df){
precision(df,truth = true_label, estimate = pred_label)
}
models <- map(by_group_type$data,stick_m_1)
models
Related
I have been struggling for a while to make my own CLDs from the output of a TukeyHSD test.
First I've done a two-way ANOVA:
aov2_arbuscular <- aov(arbuscular_count ~ block + pesticide*fertilizer, data = main_trial)
And did a TukeyHSD test as a post hoc test:
tk_arbuscular <- TukeyHSD(aov2_arbuscular)
Because I could not generate the CLD with the TukeyHSD output I used the emmeans() and cld() function.
tk_arbuscular_model <- emmeans(aov2_arbuscular,
pairwise ~ pesticide*fertilizer,
adjust = "tukey")
tk_arbuscular_model_cld <- cld(tk_arbuscular_model$emmeans,
alpha = .05,
Letters = letters)
I thought that both the TukeyHSD and emmeans with adjust = "tukey" result in the same output. Which they do for the most results, with unfortunately a few exceptions.
I have already written my result part and do not want to adjust all the p-values again. Therefore, can someone help me to generate the CLDs with the TukeyHSD output, so I can integrate them in a ggplot?
You did not provide your data, so I am creating my own reprex that can also be analyzed as a two-factorial block design, i.e. with a model similar to yours.
As you can see, I could not reproduce your problem - all p-values are basically identical. I noticed that you added the adjust = "Tukey" to the emmeans() statement, but I usually add it to the cld() statement instead - however that shouldn't be the problem.
library(tidyverse)
library(emmeans)
library(multcomp)
library(multcompView)
dataURL <- "https://raw.githubusercontent.com/SchmidtPaul/DSFAIR/master/data/Gomez%26Gomez1984.csv"
dat <- read_csv(dataURL) %>%
filter(G %in% c("A", "B") & N %in% c("N1", "N2")) %>%
mutate_at(vars(rep:N), as.factor)
aov <- aov(yield ~ G + N + G:N + rep, data = dat)
# get contrasts via 3 options ---------------------------------------------
option1 <- stats::TukeyHSD(aov) %>%
pluck("G:N")
option2 <- emmeans::emmeans(aov, ~ G:N) %>%
emmeans::pairs(adjust = "Tukey")
option3 <- emmeans::emmeans(aov, ~ G:N) %>%
multcomp::cld(adjust = "Tukey", details = TRUE)
# uniform format ----------------------------------------------------------
option1 <- option1 %>%
as_tibble(rownames = "contrast") %>%
transmute(contrast = contrast,
estimate = diff,
p.value = `p adj`)
option2 <- option2 %>%
as_tibble() %>%
dplyr::select(contrast, estimate, p.value)
option3 <- option3 %>%
pluck("comparisons") %>%
as_tibble() %>%
dplyr::select(contrast, estimate, p.value)
# compare -----------------------------------------------------------------
option1
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 B:N1-A:N1 53.3 0.999
#> 2 A:N2-A:N1 1419. 0.0860
#> 3 B:N2-A:N1 1729. 0.0401
#> 4 A:N2-B:N1 1366 0.0984
#> 5 B:N2-B:N1 1676 0.0455
#> 6 B:N2-A:N2 310 0.910
option2
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 A N1 - B N1 -53.3 0.999
#> 2 A N1 - A N2 -1419. 0.0860
#> 3 A N1 - B N2 -1729. 0.0401
#> 4 B N1 - A N2 -1366 0.0984
#> 5 B N1 - B N2 -1676 0.0455
#> 6 A N2 - B N2 -310. 0.910
option3
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 B N1 - A N1 53.3 0.999
#> 2 A N2 - A N1 1419. 0.0860
#> 3 A N2 - B N1 1366 0.0984
#> 4 B N2 - A N1 1729. 0.0401
#> 5 B N2 - B N1 1676 0.0455
#> 6 B N2 - A N2 310. 0.910
tibble(
o1_p = option1$p.value,
o2_p = option2$p.value,
o3_p = option3$p.value
) %>% cor()
#> o1_p o2_p o3_p
#> o1_p 1.0000000 1.0000000 0.9967731
#> o2_p 1.0000000 1.0000000 0.9967731
#> o3_p 0.9967731 0.9967731 1.0000000
Created on 2021-12-02 by the reprex package (v2.0.1)
Does this help? If not, can you try to create a reproducible example with your data so that we have all the information?
Also, here is a chapter I wrote on using and interpreting the compact letter display.
I'm testing the accuracy of an imputation model using training and test datasets. The model I'm running uses a categorical variable. Unfortunately, when I randomly split the dataset and run a model on the training set, I am unable to estimate a coefficient for some categorical variables which are present in the test dataset. I would like to split the data while ensuring that all categorical variables are present in both the training and test datasets. Is there an easy way to do this in R?
In the simulated data below, this would require the same sets of letters to be present in both datasets, so that I can test the accuracy of the model in the test dataset.
chars<-c("A","B","C","D")
complete_data<-data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
In my dataset, the problem is a little trickier as some of the categorical variables are extremely scarce.
EDIT:
Thanks for the responses. I ended up looking up stratified sampling as Antimon suggested and came across the caret package which apparently works as well.
library(caret)
train.index <- createDataPartition(complete_data$v3, p = .7, list = FALSE)
train <- complete_data[ train.index,]
test <- complete_data[-train.index,]
This can be achieved quite simply.
library(tidyverse)
chars<-c("A","B","C","D")
complete_data <- tibble(v1=rnorm(100,2,100),
v2=rnorm(100,1,100),
v3=sample(chars, 100, replace=TRUE))
propCategory = function(data, category, prop){
category = enquo(category)
cat1 = data %>% pull(!!category)
unlist(sapply(as.list(unique(cat1)), function(x) {sample(which(cat1==x), sum(cat1==x)*prop)}))
}
complete_data %>% propCategory(v3, .2)
output
[1] 98 35 20 78 40 70 87 3 86 38 22 100 80 93 47 5 24 29 26
As you can see, my propCategory function returns the axial indexes. But let's check if they contain what you need.
First, let's check the training indexes.
train = complete_data %>% propCategory(v3, .75)
complete_data[train,] %>% distinct(v3)
complete_data[train,] %>% nrow()
output
> complete_data[train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[train,] %>% nrow()
[1] 74
Now it's time for the test indexes.
complete_data[-train,] %>% distinct(v3)
complete_data[-train,] %>% nrow()
output
> complete_data[-train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[-train,] %>% nrow()
[1] 26
As you can see, both the training and test data include each of your categories.
A little note about the prop parameter.
My propCategory function was written in such a way that for each value from the variable category it returns the number of randomly selected indices with prop * (the number of saved values of the categorical variable).
Take a good look at the results below.
complete_data %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[-train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
output
> complete_data %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 26 0.26
2 B 35 0.35
3 C 24 0.24
4 D 15 0.15
> complete_data[train,] %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 19 0.257
2 B 26 0.351
3 C 18 0.243
4 D 11 0.149
> complete_data[-train,] %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 7 0.269
2 B 9 0.346
3 C 6 0.231
4 D 4 0.154
There are several ways of doing this. You need to divide your data by v3 and then split each group randomly:
chars <- c("A","B","C","D")
complete_data <- data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
Now we'll use the by() function to split the data into groups by v3 and draw a random sample of half of the rownames in each group:
test <- as.numeric(unlist(by(complete_data, complete_data$v3, function(x) sample(rownames(x), length(rownames(x))/2))))
train_test <- rep("train", nrow(complete_data))
train_test[test] <- "test"
table(complete_data$v3, train_test)
# train_test
# test train
# A 11 12
# B 12 13
# C 13 14
# D 12 13
Now complete_data[train_test=="train", ] is your training set and complete_data[train_test=="test", ] is your test set.
I am using a self declared function that runs a regression analysis. I want to run this for thousands of companies for multiple years, thus speed is essential. My function creates three outputs (a coefficient, the p value and r-squared). The function runs fine individually, however when I use mutate() to let it run through the whole dataset, it only gives the same values for all rows. The weirdest thing is that I can't reproduce those particular values by running the function individually. I made an reproducible example below. I have used lapply successfully before with this data, but I would like to keep it in mutate and above all I would like to know what's exactly happening here.
So my question is: how can I make this function work for each individual row for the companies dataset using mutate?
library(tidyverse)
companies <- data.frame(comp_id = 1:5)
individuals <- data.frame(id = 1:100,
comp_id = sample(1:5, 100, replace = T),
age = sample(18:67, 100, replace = T),
wage = sample(1700:10000, 100, replace = T))
regger <- function(x){
df <- individuals %>% filter(comp_id == x)
formula <- wage ~ age
regression <- lm(formula, df)
res <- list(coeff = summary(regression)$coefficient[2,1],
p = summary(regression)$coefficients[2,4],
r2 = summary(regression)$r.squared)
return(res)
}
companies %>%
mutate(data = list(regger(comp_id))) %>%
unnest_wider(data)
output:
# A tibble: 5 x 4
comp_id coeff p r2
<int> <dbl> <dbl> <dbl>
1 1 -4.92 0.916 0.000666
2 2 -4.92 0.916 0.000666
3 3 -4.92 0.916 0.000666
4 4 -4.92 0.916 0.000666
5 5 -4.92 0.916 0.000666
Use map from the purrr package if a function is not vectorized:
library(tidyverse)
set.seed(1337)
companies <- data.frame(comp_id = 1:5)
individuals <- data.frame(
id = 1:100,
comp_id = sample(1:5, 100, replace = T),
age = sample(18:67, 100, replace = T),
wage = sample(1700:10000, 100, replace = T)
)
regger <- function(x) {
df <- individuals %>% filter(comp_id == x)
formula <- wage ~ age
regression <- lm(formula, df)
res <- list(
coeff = summary(regression)$coefficient[2, 1],
p = summary(regression)$coefficients[2, 4],
r2 = summary(regression)$r.squared
)
return(res)
}
companies %>%
mutate(data = comp_id %>% map(regger)) %>%
unnest_wider(data)
#> # A tibble: 5 x 4
#> comp_id coeff p r2
#> <int> <dbl> <dbl> <dbl>
#> 1 1 67.1 0.108 0.218
#> 2 2 23.7 0.466 0.0337
#> 3 3 31.2 0.292 0.0462
#> 4 4 18.4 0.582 0.0134
#> 5 5 0.407 0.994 0.00000371
Created on 2021-09-09 by the reprex package (v2.0.1)
I'm not sure what the output should look like, but could it be that you need to work on a row-by-row basis?
companies %>%
rowwise() %>%
mutate(data = list(regger(comp_id))) %>%
unnest_wider(data)
comp_id coeff p r2
<int> <dbl> <dbl> <dbl>
1 1 21.6 0.470 0.0264
2 2 13.5 0.782 0.00390
3 3 0.593 0.984 0.0000175
4 4 -9.33 0.824 0.00394
5 5 64.9 0.145 0.156
Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2 center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!
kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster column is from the model results, not the input data.
You can also do the same thing with do, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.
I wanted to evaluate the performance of several regression model and used the yardstick package to calculate the RMSE. Here is some example data
model obs pred
1 A 1 1
2 B 1 2
3 C 1 3
When I run the following code
library(yardstick)
library(dplyr)
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))
I get the following error
Error in summarise_impl(.data, dots) :
no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')".
However, when I explicitly supply . as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect.
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 1.29
2 B 1.29
3 C 1.29
I was expecting the following
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 0
2 B 1.00
3 C 2.00
I know that there are alternatives to this function but still I don't understand this behavior.
data
dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")
Based on the help page ?yardstick::rmse, it looks like it expects a data frame as its first argument, which explains the error you're getting.
I'm not quite up to speed on that new package, but it seems that the function expects to calculate a summary statistic across a data frame, rather than a row-by-row calculation. To force it to run it row-by-row, you'd need to make it think that each row is its own dataframe, and apply the function within each of those data frames:
library(tidyverse)
dat %>%
group_by(model) %>%
nest() %>%
mutate(rmse_res = map(data, rmse, truth = obs, estimate = pred)) %>%
unnest(rmse_res)
# A tibble: 3 x 3
model data rmse
<fctr> <list> <dbl>
1 A <tibble [1 x 2]> 0
2 B <tibble [1 x 2]> 1.00
3 C <tibble [1 x 2]> 2.00
We can use the do function to apply the rmse function to every group.
dat %>%
group_by(model) %>%
do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1],
RMSE = yardstick::rmse(., truth = obs, estimate = pred)))
# # A tibble: 3 x 4
# # Groups: model [3]
# model obs pred RMSE
# <fctr> <dbl> <int> <dbl>
# 1 A 1.00 1 0
# 2 B 1.00 2 1.00
# 3 C 1.00 3 2.00
Or we can split the data frame and apply the rmse function.
dat %>%
mutate(RMSE = dat %>%
split(.$model) %>%
sapply(yardstick::rmse, truth = obs, estimate = pred))
# model obs pred RMSE
# 1 A 1 1 0
# 2 B 1 2 1
# 3 C 1 3 2
Or we can nest the obs and pred column to a list column and then apply the rmse function.
library(tidyr)
dat %>%
nest(obs, pred) %>%
mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
unnest()
# model RMSE obs pred
# 1 A 0 1 1
# 2 B 1 1 2
# 3 C 2 1 3
The output of these three methods are a little bit different, but all contain the right RMSE calculation. Here I use the microbenchmark package to conduct a performance evaluation.
library(microbenchmark)
microbenchmark(m1 = {dat %>%
group_by(model) %>%
do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1],
RMSE = yardstick::rmse(., truth = obs, estimate = pred)))},
m2 = {dat %>%
mutate(RMSE = dat %>%
split(.$model) %>%
sapply(yardstick::rmse, truth = obs, estimate = pred))},
m3 = {dat %>%
nest(obs, pred) %>%
mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
unnest()})
# Unit: milliseconds
# expr min lq mean median uq max neval
# m1 43.18746 46.71055 50.23383 48.46554 51.05639 174.46371 100
# m2 14.08516 14.78093 16.14605 15.74505 16.89936 24.02136 100
# m3 28.99795 30.90407 32.71092 31.89954 33.94729 44.57953 100
The result shows that m2 is the fastest, while m1 is the slowest. I think the implication is do operation is usually slower then other methods, so if possible, we should avoid the do operation. Although m2 is the fastest, personally I like the syntax of m3 the best. The nested data frame will allow us to easily summarize information between different models or different groups.