Calculating upper and lower confidence intervals by group in dplyr summarise() - r

I am trying to make a table that shows N (number of observations), percent frequency (of answers > 0), and the lower and upper confidence intervals for percent frequency, and I want to group this by type.
Example of data
dat <- data.frame(
"type" = c("B","B","A","B","A","A","B","A","A","B","A","A","A","B","B","B"),
"num" = c(3,0,0,9,6,0,4,1,1,5,6,1,3,0,0,0)
)
Expected output (with values filled in):
Type N Percent Lower 95% CI Upper 95% CI
A
B
Attempt
library(dplyr)
library(qwraps2)
table<-dat %>%
group_by(type) %>%
summarise(N=n(),
mean.ci = mean_ci(dat$num),
"Percent"=n_perc(num > 0))
This worked to get N and percent frequency, but returned an error: "Column must be length 1 (a summary value), not 3" when I added in mean_ci
The second code I tried, found here:
table2<-dat %>%
group_by(type) %>%
summarise(N.num=n(),
mean.num = mean(dat$num),
sd.num = sd(dat$num),
"Percent"=n_perc(num > 0)) %>%
mutate(se.num = sd.num / sqrt(N.num),
lower.ci = 100*(mean.num - qt(1 - (0.05 / 2), N.num - 1) * se.num),
upper.ci = 100*(mean.num + qt(1 - (0.05 / 2), N.num - 1) * se.num))
# A tibble: 2 x 8
# type N.num mean.num sd.num Percent se.num lower.ci upper.ci
# <fct> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#1 A 8 2.44 2.83 "6 (75.00\\%)" 1.00 7.35 480.
#2 B 8 2.44 2.83 "4 (50.00\\%)" 1.00 7.35 480.
This gave me an output, but the confidence intervals are not logical.

The output of mean_ci is a vector of length 3. This is maybe unexpected because the package has added a print method so that when you see this in the console it looks like a single character value and not a numeric length > 1 vector. But, you can see the underlying data structure by looking at str.
mean_ci(dat$num) %>% str
# 'qwraps2_mean_ci' Named num [1:3] 2.44 1.05 3.82
# - attr(*, "names")= chr [1:3] "mean" "lcl" "ucl"
# - attr(*, "alpha")= num 0.05
In summarize, each element of each column of the output needs to be length 1, so providing a length 3 object for summarize to put in a single "cell" (column element) results in an error. A workaround is to put the length 3 vector in a list, so that it is now a length 1 list. Then you can use unnest_wider to separate it into 3 columns (and therefore making the table "wider")
library(tidyverse)
dat %>%
group_by(type) %>%
summarise( N=n(),
mean.ci = list(mean_ci(num)),
"Percent"= n_perc(num > 0)) %>%
unnest_wider(mean.ci)
# # A tibble: 2 x 6
# type N mean lcl ucl Percent
# <fct> <int> <dbl> <dbl> <dbl> <chr>
# 1 A 8 2.25 0.523 3.98 "6 (75.00\\%)"
# 2 B 8 2.62 0.344 4.91 "4 (50.00\\%)"

IceCreamToucan’s answer is very good. I’m posting this answer to offer a
different way to present the information.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(qwraps2)
dat <- data.frame("type" = c("B","B","A","B","A","A","B","A","A","B","A","A","A","B","B","B"),
"num" = c(3,0,0,9,6,0,4,1,1,5,6,1,3,0,0,0))
When building the dplyr::summarize call you can use the qwraps2::frmtci
call to format the output of qwraps2::mean_ci into a character string of
length one.
I would also recommend using the data pronoun .data so you can be explicit
about the variables to summarize.
dat %>%
dplyr::group_by(type) %>%
dplyr::summarize(N = n(),
mean.ci = qwraps2::frmtci(qwraps2::mean_ci(.data$num)),
Percent = qwraps2::n_perc(.data$num > 0))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 4
#> type N mean.ci Percent
#> <chr> <int> <chr> <chr>
#> 1 A 8 2.25 (0.52, 3.98) "6 (75.00\\%)"
#> 2 B 8 2.62 (0.34, 4.91) "4 (50.00\\%)"
Created on 2020-09-15 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os macOS Catalina 10.15.6
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Denver
#> date 2020-09-15
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.9 2020-08-24 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.0)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.0)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> qwraps2 * 0.5.0 2020-09-14 [1] local
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.17 2020-09-09 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Related

tidymodel recipe and `step_lag()`: Error when using `predict()`

This may be a usage misunderstanding, but I expect the following toy example to work. I want to have a lagged predictor in my recipe, but once I include it in the recipe, and try to predict on the same data using a workflow with the recipe, it doesn't recognize the column foo and cannot compute its lag.
Now, I can get this to work if I:
Pull the fit out of the workflow that has been fit.
Independently prep and bake the data I want to fit.
Which I code after the failed workflow fit, and it succeeds. According to the documentation, I should be able to put a workflow fit in the predict slot: https://www.tidymodels.org/start/recipes/#predict-workflow
I am probably fundamentally misunderstanding how workflow is supposed to operate. I have what I consider a workaround, but I do not understand why the failed statement isn't working in the way the workaround is. I expected the failed workflow construct to work under the covers like the workaround I have.
In short, if work_df is a dataframe, the_rec is a recipe based off work_df, rf_mod is a model, and you create the workflow rf_workflow, then should I expect the predict() function to work identically in the two predict() calls below?
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(the_rec)
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict with workflow. I expect since a workflow has a fit model and
## a recipe as part of it, it should know how to do the following:
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Predict by explicitly prepping and baking the data, and pulling out the
## fit from the workflow:
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
Full reprex example below.
library(tidymodels)
#> -- Attaching packages -------------------------------------------------------------------------------------- tidymodels 0.1.1 --
#> v broom 0.7.1 v recipes 0.1.13
#> v dials 0.0.9 v rsample 0.0.8
#> v dplyr 1.0.2 v tibble 3.0.3
#> v ggplot2 3.3.2 v tidyr 1.1.2
#> v infer 0.5.3 v tune 0.1.1
#> v modeldata 0.0.2 v workflows 0.2.1
#> v parsnip 0.1.3 v yardstick 0.0.7
#> v purrr 0.3.4
#> -- Conflicts ----------------------------------------------------------------------------------------- tidymodels_conflicts() --
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
library(dplyr)
set.seed(123)
### Create autocorrelated timeseries: https://stafoo.stackexchange.com/a/29242/17203
work_df <-
tibble(
foo = stats::filter(rnorm(1000), filter=rep(1,5), circular=TRUE) %>%
as.numeric()
)
# plot(work_df$foo)
work_df
#> # A tibble: 1,000 x 1
#> foo
#> <dbl>
#> 1 -0.00375
#> 2 0.589
#> 3 0.968
#> 4 3.24
#> 5 3.93
#> 6 1.11
#> 7 0.353
#> 8 -0.222
#> 9 -0.713
#> 10 -0.814
#> # ... with 990 more rows
## Recipe
the_rec <-
recipe(foo ~ ., data = work_df) %>%
step_lag(foo, lag=1:5) %>%
step_naomit(all_predictors())
the_rec %>% prep() %>% juice()
#> # A tibble: 995 x 6
#> foo lag_1_foo lag_2_foo lag_3_foo lag_4_foo lag_5_foo
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 3.93 3.24 0.968 0.589 -0.00375
#> 2 0.353 1.11 3.93 3.24 0.968 0.589
#> 3 -0.222 0.353 1.11 3.93 3.24 0.968
#> 4 -0.713 -0.222 0.353 1.11 3.93 3.24
#> 5 -0.814 -0.713 -0.222 0.353 1.11 3.93
#> 6 0.852 -0.814 -0.713 -0.222 0.353 1.11
#> 7 1.65 0.852 -0.814 -0.713 -0.222 0.353
#> 8 1.54 1.65 0.852 -0.814 -0.713 -0.222
#> 9 2.10 1.54 1.65 0.852 -0.814 -0.713
#> 10 2.24 2.10 1.54 1.65 0.852 -0.814
#> # ... with 985 more rows
## Model
rf_mod <-
rand_forest(
mtry = 4,
trees = 1000,
min_n = 13) %>%
set_mode("regression") %>%
set_engine("ranger")
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(the_rec)
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Perhaps I just need to pull off the fit and work with that?... Nope.
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
work_df)
#> Error: Can't subset columns that don't exist.
#> x Columns `lag_1_foo`, `lag_2_foo`, `lag_3_foo`, `lag_4_foo`, and `lag_5_foo` don't exist.
## Maybe I need to bake it first... and that works.
## But doesn't that defeat the purpose of a workflow?
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
#> 4 -0.977
#> 5 -1.24
#> 6 -0.140
#> 7 1.36
#> 8 1.30
#> 9 1.78
#> 10 2.42
#> # ... with 985 more rows
## Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.3 (2020-02-29)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/Chicago
#> date 2020-10-13
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.3)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 3.6.3)
#> broom * 0.7.1 2020-10-02 [1] CRAN (R 3.6.3)
#> class 7.3-15 2019-01-01 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> codetools 0.2-16 2018-12-24 [1] CRAN (R 3.6.3)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.3)
#> dials * 0.0.9 2020-09-16 [1] CRAN (R 3.6.3)
#> DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.3)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.3)
#> foreach 1.5.0 2020-03-30 [1] CRAN (R 3.6.3)
#> furrr 0.1.0 2018-05-16 [1] CRAN (R 3.6.3)
#> future 1.19.1 2020-09-22 [1] CRAN (R 3.6.3)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.3)
#> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 3.6.3)
#> globals 0.13.0 2020-09-17 [1] CRAN (R 3.6.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.3)
#> gower 0.2.2 2020-06-23 [1] CRAN (R 3.6.3)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 3.6.3)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.3)
#> hardhat 0.1.4 2020-07-02 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.3)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 3.6.3)
#> infer * 0.5.3 2020-07-14 [1] CRAN (R 3.6.3)
#> ipred 0.9-9 2019-04-28 [1] CRAN (R 3.6.3)
#> iterators 1.0.12 2019-07-26 [1] CRAN (R 3.6.3)
#> knitr 1.30 2020-09-22 [1] CRAN (R 3.6.3)
#> lattice 0.20-38 2018-11-04 [1] CRAN (R 3.6.3)
#> lava 1.6.8 2020-09-26 [1] CRAN (R 3.6.3)
#> lhs 1.1.1 2020-10-05 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.3)
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.3)
#> MASS 7.3-51.5 2019-12-20 [1] CRAN (R 3.6.3)
#> Matrix 1.2-18 2019-11-27 [1] CRAN (R 3.6.3)
#> modeldata * 0.0.2 2020-06-22 [1] CRAN (R 3.6.3)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.3)
#> nnet 7.3-12 2016-02-02 [1] CRAN (R 3.6.3)
#> parsnip * 0.1.3 2020-08-04 [1] CRAN (R 3.6.3)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.3)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 3.6.3)
#> pROC 1.16.2 2020-03-19 [1] CRAN (R 3.6.3)
#> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 3.6.3)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.3)
#> ranger 0.12.1 2020-01-10 [1] CRAN (R 3.6.3)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 3.6.3)
#> recipes * 0.1.13 2020-06-23 [1] CRAN (R 3.6.3)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 3.6.3)
#> rmarkdown 2.4 2020-09-30 [1] CRAN (R 3.6.3)
#> rpart 4.1-15 2019-04-12 [1] CRAN (R 3.6.3)
#> rsample * 0.0.8 2020-09-23 [1] CRAN (R 3.6.3)
#> rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.3)
#> scales * 1.1.1 2020-05-11 [1] CRAN (R 3.6.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.3)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.3)
#> survival 3.1-8 2019-12-03 [1] CRAN (R 3.6.3)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 3.6.3)
#> tidymodels * 0.1.1 2020-07-14 [1] CRAN (R 3.6.3)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> timeDate 3043.102 2018-02-21 [1] CRAN (R 3.6.3)
#> tune * 0.1.1 2020-07-08 [1] CRAN (R 3.6.3)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.3)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 3.6.3)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 3.6.3)
#> workflows * 0.2.1 2020-10-08 [1] CRAN (R 3.6.3)
#> xfun 0.18 2020-09-29 [1] CRAN (R 3.6.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.3)
#> yardstick * 0.0.7 2020-07-13 [1] CRAN (R 3.6.3)
#>
#> [1] C:/Users/IRINZN/Documents/R/R-3.6.3/library
Created on 2020-10-13 by the reprex package (v0.3.0)
The reason you are experiencing an error is that you have created a predictor variable from the outcome. When it comes time to predict on new data, the outcome is not available; we are predicting the outcome for new data, not assuming that it is there already.
This is a fairly strong assumption of the tidymodels framework, for either modeling or preprocessing, to protect against information leakage. You can read about this a bit more here.
It's possible you already know about these resources, but if you are working with time series models, I'd suggest checking out these resources:
Resampling for time series
Using timetk for time series preprocessing
Using modeltime for time series modeling

summarise() in dplyr 1.0.2 acting like mutate()

Given a tibble that lists users, products, and product features, I am attempting to calculate the fraction of distinct product users who have a certain product feature:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tribble(
~users, ~product, ~feature,
"bob","iPhone","screen",
"bob","iPhone","camera",
"bob","iPhone","facial recognition",
"sally","Android","screen",
"sally","Android","camera",
"sally","Android","facial recognition",
"joe","Huawei","screen",
"joe","Huawei","camera",
"joe","Huawei","facial recognition",
"rachel","iPhone","screen",
"rachel","iPhone","camera",
"rachel","iPhone","fingerprint sensor"
)
# Get count of distinct users by product
df <- df %>%
group_by(product) %>%
mutate(n_users = n_distinct(users)) %>%
ungroup()
df
#> # A tibble: 12 x 4
#> users product feature n_users
#> <chr> <chr> <chr> <int>
#> 1 bob iPhone screen 2
#> 2 bob iPhone camera 2
#> 3 bob iPhone facial recognition 2
#> 4 sally Android screen 1
#> 5 sally Android camera 1
#> 6 sally Android facial recognition 1
#> 7 joe Huawei screen 1
#> 8 joe Huawei camera 1
#> 9 joe Huawei facial recognition 1
#> 10 rachel iPhone screen 2
#> 11 rachel iPhone camera 2
#> 12 rachel iPhone fingerprint sensor 2
# Count the fraction of distinct users with given product feature
df <- df %>%
group_by(product, feature) %>%
summarise(feature_fraction = n()/n_users,
.groups = "drop_last")
df
#> # A tibble: 12 x 3
#> # Groups: product [3]
#> product feature feature_fraction
#> <chr> <chr> <dbl>
#> 1 Android camera 1
#> 2 Android facial recognition 1
#> 3 Android screen 1
#> 4 Huawei camera 1
#> 5 Huawei facial recognition 1
#> 6 Huawei screen 1
#> 7 iPhone camera 1
#> 8 iPhone camera 1
#> 9 iPhone facial recognition 0.5
#> 10 iPhone fingerprint sensor 0.5
#> 11 iPhone screen 1
#> 12 iPhone screen 1
Created on 2020-10-23 by the reprex package (v0.3.0)
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2020-10-23
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.2)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
As can be seen, the final tibble has multiple rows for group-key pairs with the same summary value. This is, to my knowledge, unexpected behavior for summarise and seems almost the same as what mutate would return. Given this open github issue, it appears that maybe all the kinks haven't been ironed out of the new version of summarise. I also could just be being stupid, and would appreciate if someone could help get me back on track!
The problem is your have multiple values for n_users for each group. The latest version of dplyr allow you to return more than one row per group if your summary function returns multiple values.
If you want to assume all the values for n_users will be the same per group, then you can do
df %>%
group_by(product, feature) %>%
summarise(feature_fraction = n()/first(n_users),
.groups = "drop_last")
That will make sure only one value is returned per group

Tidymodels tune_grid: "Can't subset columns that don't exist" when not using formula

I've put together a data preprocessing recipe for the recent coffee dataset featured on TidyTuesday. My intention is to generate a workflow, and then from there tune a hyperparameter. I'm specifically interesting in manually declaring predictors and outcomes through the various update_role() functions, rather than using a formula, since I have some great plans for this style of variable selection (it's a really great idea!).
The example below produces a recipe that works just fine with prep and bake(coffee_test). It even works if I deselect the outcome column, eg. coffee_recipe %>% bake(select(coffee_test, -cupper_points)). However, when I run the workflow through tune_grid I get the errors as shown. It looks like tune_grid can't find the variables that don't have the "predictor" role, even though bake does just fine.
Now, if I instead do things the normal way with a formula and step_rm the variables I don't care about, then things mostly work --- I get a few warnings for rows with missing country_of_origin values, which I find strange since I should be imputing those. It's entirely possible I've misunderstood the purpose of roles and how to use them.
library(tidyverse)
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8 ✓ rsample 0.0.7
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.2
#> ✓ parsnip 0.1.2 ✓ yardstick 0.0.7
#> ── Conflicts ──────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter() masks stats::filter()
#> x recipes::fixed() masks stringr::fixed()
#> x dplyr::lag() masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
set.seed(12345)
coffee <- tidytuesdayR::tt_load(2020, week = 28)$coffee_ratings
#> --- Compiling #TidyTuesday Information for 2020-07-07 ----
#> --- There is 1 file available ---
#> --- Starting Download ---
#>
#> Downloading file 1 of 1: `coffee_ratings.csv`
#> --- Download complete ---
colnames(coffee)
#> [1] "total_cup_points" "species" "owner"
#> [4] "country_of_origin" "farm_name" "lot_number"
#> [7] "mill" "ico_number" "company"
#> [10] "altitude" "region" "producer"
#> [13] "number_of_bags" "bag_weight" "in_country_partner"
#> [16] "harvest_year" "grading_date" "owner_1"
#> [19] "variety" "processing_method" "aroma"
#> [22] "flavor" "aftertaste" "acidity"
#> [25] "body" "balance" "uniformity"
#> [28] "clean_cup" "sweetness" "cupper_points"
#> [31] "moisture" "category_one_defects" "quakers"
#> [34] "color" "category_two_defects" "expiration"
#> [37] "certification_body" "certification_address" "certification_contact"
#> [40] "unit_of_measurement" "altitude_low_meters" "altitude_high_meters"
#> [43] "altitude_mean_meters"
coffee_split <- initial_split(coffee, prop = 0.8)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
coffee_recipe <- recipe(coffee_train) %>%
update_role(cupper_points, new_role = "outcome") %>%
update_role(
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
step_knnimpute(
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
)
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
coffee_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 9
#>
#> 33 variables with undeclared roles
#>
#> Operations:
#>
#> Factor variables from all_nominal(), -all_outcomes()
#> K-nearest neighbor imputation for country_of_origin, altitude_mean_meters
#> Unknown factor level assignment for variety, processing_method
#> Collapsing factor levels for country_of_origin
#> Collapsing factor levels for processing_method
#> Collapsing factor levels for variety
# This works just fine
coffee_recipe %>%
prep(coffee_train) %>%
bake(select(coffee_test, -cupper_points)) %>%
head()
#> # A tibble: 6 x 42
#> total_cup_points species owner country_of_orig… farm_name lot_number mill
#> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 90.6 Arabica meta… Ethiopia metad plc <NA> meta…
#> 2 87.9 Arabica cqi … other <NA> <NA> <NA>
#> 3 87.9 Arabica grou… United States (… <NA> <NA> <NA>
#> 4 87.3 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> 5 87.2 Arabica cqi … other <NA> <NA> <NA>
#> 6 86.9 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> # … with 35 more variables: ico_number <fct>, company <fct>, altitude <fct>,
#> # region <fct>, producer <fct>, number_of_bags <dbl>, bag_weight <fct>,
#> # in_country_partner <fct>, harvest_year <fct>, grading_date <fct>,
#> # owner_1 <fct>, variety <fct>, processing_method <fct>, aroma <dbl>,
#> # flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
#> # uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, moisture <dbl>,
#> # category_one_defects <dbl>, quakers <dbl>, color <fct>,
#> # category_two_defects <dbl>, expiration <fct>, certification_body <fct>,
#> # certification_address <fct>, certification_contact <fct>,
#> # unit_of_measurement <fct>, altitude_low_meters <dbl>,
#> # altitude_high_meters <dbl>, altitude_mean_meters <dbl>
# Now let's try putting it into a workflow and running tune_grid
coffee_model <- rand_forest(trees = 500, mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("regression")
coffee_model
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#>
#> Computational engine: ranger
coffee_workflow <- workflow() %>%
add_recipe(coffee_recipe) %>%
add_model(coffee_model)
coffee_workflow
#> ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> ── Preprocessor ───────────────────────────────────────────────────────────────────────────────
#> 6 Recipe Steps
#>
#> ● step_string2factor()
#> ● step_knnimpute()
#> ● step_unknown()
#> ● step_other()
#> ● step_other()
#> ● step_other()
#>
#> ── Model ──────────────────────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#>
#> Computational engine: ranger
coffee_grid <- expand_grid(mtry = c(2, 5))
coffee_folds <- vfold_cv(coffee_train, v = 5)
coffee_workflow %>%
tune_grid(
resamples = coffee_folds,
grid = coffee_grid
)
#> x Fold1: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold1: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> Warning: All models failed in tune_grid(). See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [857/215]> Fold1 <NULL> <tibble [2 × 1]>
#> 2 <split [857/215]> Fold2 <NULL> <tibble [2 × 1]>
#> 3 <split [858/214]> Fold3 <NULL> <tibble [2 × 1]>
#> 4 <split [858/214]> Fold4 <NULL> <tibble [2 × 1]>
#> 5 <split [858/214]> Fold5 <NULL> <tibble [2 × 1]>
Created on 2020-07-21 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.0 (2020-04-24)
#> os Ubuntu 20.04 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en_AU:en
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2020-07-21
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.8 2020-06-17 [1] CRAN (R 4.0.0)
#> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.0)
#> broom * 0.7.0 2020-07-09 [1] CRAN (R 4.0.0)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
#> class 7.3-17 2020-04-26 [4] CRAN (R 4.0.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> codetools 0.2-16 2018-12-24 [4] CRAN (R 4.0.0)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> curl 4.3 2019-12-02 [1] CRAN (R 4.0.0)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0)
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0)
#> dials * 0.0.8 2020-07-08 [1] CRAN (R 4.0.0)
#> DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 4.0.0)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 4.0.0)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0)
#> foreach 1.5.0 2020-03-30 [1] CRAN (R 4.0.0)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0)
#> furrr 0.1.0 2018-05-16 [1] CRAN (R 4.0.0)
#> future 1.17.0 2020-04-18 [1] CRAN (R 4.0.0)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> ggplot2 * 3.3.2.9000 2020-07-10 [1] Github (tidyverse/ggplot2#a11e098)
#> globals 0.12.5 2019-12-07 [1] CRAN (R 4.0.0)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0)
#> gower 0.2.2 2020-06-23 [1] CRAN (R 4.0.0)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
#> hardhat 0.1.4 2020-07-02 [1] CRAN (R 4.0.0)
#> haven 2.2.0 2019-11-08 [1] CRAN (R 4.0.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.0)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 4.0.0)
#> infer * 0.5.3 2020-07-14 [1] CRAN (R 4.0.0)
#> ipred 0.9-9 2019-04-28 [1] CRAN (R 4.0.0)
#> iterators 1.0.12 2019-07-26 [1] CRAN (R 4.0.0)
#> jsonlite 1.7.0 2020-06-25 [1] CRAN (R 4.0.0)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.0)
#> lattice 0.20-41 2020-04-02 [4] CRAN (R 4.0.0)
#> lava 1.6.7 2020-03-05 [1] CRAN (R 4.0.0)
#> lhs 1.0.2 2020-04-13 [1] CRAN (R 4.0.0)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.0)
#> lubridate 1.7.8 2020-04-06 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> MASS 7.3-51.6 2020-04-26 [4] CRAN (R 4.0.0)
#> Matrix 1.2-18 2019-11-27 [4] CRAN (R 4.0.0)
#> memoise 1.1.0.9000 2020-05-09 [1] Github (hadley/memoise#4aefd9f)
#> modeldata * 0.0.2 2020-06-22 [1] CRAN (R 4.0.0)
#> modelr 0.1.6 2020-02-22 [1] CRAN (R 4.0.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
#> nnet 7.3-14 2020-04-26 [4] CRAN (R 4.0.0)
#> parsnip * 0.1.2 2020-07-03 [1] CRAN (R 4.0.0)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.0)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 4.0.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.0)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> pROC 1.16.2 2020-03-19 [1] CRAN (R 4.0.0)
#> processx 3.4.3 2020-07-05 [1] CRAN (R 4.0.0)
#> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
#> ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> ranger 0.12.1 2020-01-10 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.0)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
#> recipes * 0.1.13 2020-06-23 [1] CRAN (R 4.0.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.0)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.0)
#> rmarkdown 2.3.2 2020-07-12 [1] Github (rstudio/rmarkdown#ff1b279)
#> rpart 4.1-15 2019-04-12 [4] CRAN (R 4.0.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> rsample * 0.0.7 2020-06-04 [1] CRAN (R 4.0.0)
#> rstudioapi 0.11 2020-02-07 [1] CRAN (R 4.0.0)
#> rvest 0.3.5 2019-11-08 [1] CRAN (R 4.0.0)
#> scales * 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
#> selectr 0.4-2 2019-11-20 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> survival 3.1-12 2020-04-10 [4] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 4.0.0)
#> tidymodels * 0.1.1 2020-07-14 [1] CRAN (R 4.0.0)
#> tidyr * 1.1.0 2020-05-20 [1] CRAN (R 4.0.0)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
#> tidytuesdayR 1.0.1 2020-07-10 [1] CRAN (R 4.0.0)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.0)
#> timeDate 3043.102 2018-02-21 [1] CRAN (R 4.0.0)
#> tune * 0.1.1 2020-07-08 [1] CRAN (R 4.0.0)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
#> vctrs 0.3.2 2020-07-15 [1] CRAN (R 4.0.0)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> workflows * 0.1.2 2020-07-07 [1] CRAN (R 4.0.0)
#> xfun 0.15 2020-06-21 [1] CRAN (R 4.0.0)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#> yardstick * 0.0.7 2020-07-13 [1] CRAN (R 4.0.0)
#>
#> [1] /home/mdneuzerling/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
The error here occurs because on step_string2factor() during tuning, the recipe starts trying to handle variables that don't have any roles, like species and owner.
Try setting the role for all of your nominal variables before picking out the outcomes and predictors.
coffee_recipe <- recipe(coffee_train) %>%
update_role(all_nominal(), new_role = "id") %>% ## ADD THIS
update_role(cupper_points, new_role = "outcome") %>%
update_role(
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
step_knnimpute(
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
)
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
After I do this, this mostly runs fine, with only some failures to impute altitude. It might be tough to impute both of those things at the same time.

Why does dplyr::mutate_at() on the first element in a rowwise-tibble also take effect on the rest of the elements?

In the following code, I defined a tibble df with two columns: name column contains a character vector of c("a", "b", "c"), and data column contains a list of tibbles, each with the column value. Then I'd like to change the column name of each tibble's value column to the character in the corresponding row, e.g. "a", "b" and "c". To manipulate the tibble in a row-wise manner, I used dplyr::rowwise(), but then I found that the changes taking effect on the first element (changing the column name to "a") also took effect on the rest of the elements (since after the first row, the printed tibble before the change of the column name showed the column name of "a"). And therefore, it can be expected that the change of column names to the following elements in the column failed, since there were no longer column names of "value" (all changed to "a"). Do I have to use a purrr::map() function here instead of the tidier row-wise tibble manipulation?
Would you please give me an answer using rowwise-mutate_at method? Thanks.
library(tidyverse)
#> Warning: 程辑包'tidyverse'是用R版本3.6.3 来建造的
#> Warning: 程辑包'ggplot2'是用R版本3.6.1 来建造的
#> Warning: 程辑包'tibble'是用R版本3.6.3 来建造的
#> Warning: 程辑包'tidyr'是用R版本3.6.1 来建造的
#> Warning: 程辑包'readr'是用R版本3.6.1 来建造的
#> Warning: 程辑包'purrr'是用R版本3.6.1 来建造的
#> Warning: 程辑包'dplyr'是用R版本3.6.3 来建造的
#> Warning: 程辑包'stringr'是用R版本3.6.1 来建造的
#> Warning: 程辑包'forcats'是用R版本3.6.3 来建造的
df <- tibble::tibble(name = c("a", "b", "c"),
data = list(tibble::tibble(value = 1:10)))
df_mutate <- df %>%
dplyr::rowwise() %>%
dplyr::mutate_at("data", ~ {
print(.x)
colnames(.x)[colnames(.x) %in% "value"] <- name
list(.x)
}) %>%
dplyr::ungroup()
#> # A tibble: 10 x 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> # A tibble: 10 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> # A tibble: 10 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
Created on 2020-06-19 by the reprex package (v0.3.0)
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.0 (2019-04-26)
#> os Windows Server x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Chinese (Simplified)_China.936
#> ctype Chinese (Simplified)_China.936
#> tz Asia/Taipei
#> date 2020-06-19
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.3)
#> callr 3.4.0 2019-12-09 [1] CRAN (R 3.6.2)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.2)
#> dbplyr 1.4.2 2019-06-17 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
#> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
#> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.2)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1)
#> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.1)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.3)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1)
#> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1)
#> haven 2.2.0 2019-11-08 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.1)
#> hms 0.5.2 2019-10-30 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
#> jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.1)
#> knitr 1.26 2019-11-12 [1] CRAN (R 3.6.2)
#> lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.0)
#> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.1)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.1)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1)
#> modelr 0.1.6 2020-02-22 [1] CRAN (R 3.6.3)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1)
#> nlme 3.1-143 2019-12-10 [1] CRAN (R 3.6.2)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2)
#> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.1)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.1)
#> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.1)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.2)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.1)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.1)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.3)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
#> rmarkdown 2.0 2019-12-12 [1] CRAN (R 3.6.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
#> rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.3)
#> scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.1)
#> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.2)
#> tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.1)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.3)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.1)
#> vctrs 0.3.0 2020-05-11 [1] CRAN (R 3.6.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1)
#> xfun 0.11 2019-11-12 [1] CRAN (R 3.6.2)
#> xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
#>
#> [1] C:/Users/xzhu/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.0/library
Yes, you can use map2 :
library(dplyr)
df %>% mutate(data = purrr::map2(name, data, ~{names(.y) <- .x;.y}))
Or Map in base R :
df$data <- Map(function(x, y) {names(y) <- x;y}, df$name, df$data)
If you want to use rowwise a similar approach would be :
df %>% rowwise() %>% mutate(data = {names(data) <- name;list(data)})

The output of the qwraps2 code is not as expected

I am trying to get the summary table output for one of the datasets but the output is not in the form of a tidy table
options(qwraps2_markup = "markdown")
age_summary <- list("Age" =
list("Min" = ~min(.data$Age),
"Max" = ~max(.data$Age),
"Mean" = ~mean_sd(.data$Age)))
age_tab <- summary_table(insurance, age_summary)
age_tab
When I knit the RMarkdown file, the summary table is similar to the one that comes as an output in the Console and not the expected formatted summary table.
The object generated by qwraps2::summary_table is a character matrix with
the class attribute qwraps2_summary_table. The
qwraps2:::print.qwraps2_summary_table and qwraps2:::print.qable methods
are responsible for the way the table is presented in the output. Chunk
options will be responsible for how the table is rendered in the output
document.
Update: as of qwraps2 version 0.5.0 the use of the .data is no longer
needed or recommended.
options(qwraps2_markup = "markdown")
library(qwraps2)
eg_data <- data.frame(Age = rnorm(1000, mean = 54, sd = 10))
age_summary <- list("Age" =
list(
"Min" = ~ min(Age),
"Max" = ~ max(Age),
"Mean" = ~ mean_sd(Age)
)
)
age_table <- summary_table(eg_data, age_summary)
Take a look at the structure of age_table
str(age_table)
#> 'qwraps2_summary_table' chr [1:3, 1] "25.1137149669314" "83.5664804448924" ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:3] "Min" "Max" "Mean"
#> ..$ : chr "eg_data (N = 1,000)"
#> - attr(*, "rgroups")= Named int 3
#> ..- attr(*, "names")= chr "Age"
#> - attr(*, "n")= int 1000
As noted above, the object is a 3 x 1 character matrix of class
qwraps2_summary_table. To see the return in the R console:
print.default(age_table)
#> eg_data (N = 1,000)
#> Min "25.1137149669314"
#> Max "83.5664804448924"
#> Mean "53.75 ± 9.94"
#> attr(,"rgroups")
#> Age
#> 3
#> attr(,"n")
#> [1] 1000
#> attr(,"class")
#> [1] "qwraps2_summary_table" "matrix" "array"
Since the options(qwraps2_markup = "markdown") as been set, the printing
method will return a markdown table
age_table
#>
#>
#> | |eg_data (N = 1,000) |
#> |:-----------------|:-------------------|
#> |**Age** | |
#> | Min |25.1137149669314 |
#> | Max |83.5664804448924 |
#> | Mean |53.75 ± 9.94 |
Make sure you have the results = "asis" chunk option set in your .Rmd file
so that the table will render correctly in your output document.
Created on 2020-09-14 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os macOS Catalina 10.15.6
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Denver
#> date 2020-09-14
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.9 2020-08-24 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.0)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> qwraps2 * 0.5.0 2020-09-14 [1] local
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.17 2020-09-09 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Resources