ddply summarise with Confidence interval? - r

I am trying to summarise my data with ddply, and I am trying to find a way I could summarise the data while reflecting the reliability.
Here is a desciption of my data set.
BSTN ASTN BSEC ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 TFtime Ttime ID
1 1001 1003 69551 1703 1703 0 0 0 0 0 0 399 2933 35404
2 1001 1006 69664 1703 1703 0 0 0 0 0 0 399 2284 35405
3 1001 1701 66606 1703 1703 0 0 0 0 0 0 118 1750 35406
4 1001 1701 66600 1703 1703 0 0 0 0 0 0 118 1750 35406
5 1001 1701 66601 1703 1703 0 0 0 0 0 0 118 1750 35406
6 1001 1703 69434 0 0 0 0 0 0 0 0 0 1005 35407
What I would like as my output is a to summarise the values of Ttime and TFtime grouped by "ASTN"s and "BSTN"s.
For the mean values of "Ttime" and "TFtime" I would like to reflect the confidence interval 95%. So calculate the mean values of "Ttime" and "TFtime"s within the 95% boundary. How would I do this process with ddply if there are multiple combinations of BSTN~ASTNs.
below is the code I used and wish to revise.
Routetable<-ddply(A,c(.(BSTN),.(ASTN1),.(BSTN2),.(ASTN2),.(BSTN3),.(ASTN3),.(BSTN4),.(ASTN4),.(BSTN5),.(ASTN)),
summarise, count=length(BSTN),mean=mean(Ttime),TFtimemean=mean(TFtime))

updated answer
I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x) and this by each group. The following approach would do that. In the case of ggplot2s Diamond data set it keeps about 97% of all values and just removes the extremes.
library(tidyverse)
diamonds %>%
group_by(cut, color) %>%
mutate(across(c(x,y,z),
list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
)
) %>%
filter(x >= x_low & x <= x_high,
y >= x_low & y <= y_high,
z >= z_low & z <= z_high)
#> # A tibble: 52,299 x 16
#> # Groups: cut, color [35]
#> carat cut color clarity depth table price x y z x_low x_high
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3.51 6.92
#> 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 3.52 7.65
#> 3 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 3.86 9.12
#> 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 4.14 8.62
#> 5 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 3.92 8.62
#> 6 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 3.66 8.30
#> 7 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 3.66 8.30
#> 8 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 4.14 8.62
#> 9 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 3.88 8.76
#> 10 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 3.88 8.76
#> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
#> # z_low <dbl>, z_high <dbl>
Created on 2020-06-23 by the reprex package (v0.3.0)
old answer
With better example data we could achieve a more programmatic approach. As example I use ggplot2s diamonds dataset. See my comments in the code below.
library(tidyverse)
diamonds %>%
# set up your groups
nest_by(cut, color) %>%
# define in `across` for which variables you want means and conf int to be calculated
mutate(ttest = list(summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x))))),
ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>%
# select only the estimates and conf intervalls
select(contains("estimate"), contains("conf")))) %>%
unnest(ttest)
#> # A tibble: 35 x 12
#> # Groups: cut, color [35]
#> cut color data x_estimate y_estimate z_estimate x_conf.low x_conf.high
#> <ord> <ord> <list<tb> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D [163 × 8] 6.02 5.96 3.84 5.89 6.15
#> 2 Fair E [224 × 8] 5.91 5.86 3.72 5.80 6.02
#> 3 Fair F [312 × 8] 5.99 5.93 3.79 5.89 6.09
#> 4 Fair G [314 × 8] 6.17 6.11 3.96 6.06 6.28
#> 5 Fair H [303 × 8] 6.58 6.50 4.22 6.47 6.69
#> 6 Fair I [175 × 8] 6.56 6.49 4.19 6.43 6.70
#> 7 Fair J [119 × 8] 6.75 6.68 4.32 6.55 6.95
#> 8 Good D [662 × 8] 5.62 5.63 3.50 5.55 5.69
#> 9 Good E [933 × 8] 5.62 5.63 3.50 5.56 5.68
#> 10 Good F [909 × 8] 5.69 5.71 3.54 5.63 5.76
#> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-19 by the reprex package (v0.3.0)
If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.
library(tidyverse)
diamonds %>%
nest_by(cut, color) %>%
mutate(ttest = summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x)))) %>%
unpack(c(x,y,z), names_sep = "_")) %>%
unpack(ttest) %>%
select(cut, color, data, contains("estimate"), contains("conf")) %>%
rowwise(cut, color) %>%
mutate(data = list(filter(data,
x >= x_conf.low & x <= x_conf.high,
y >= x_conf.low & y <= y_conf.high,
z >= z_conf.low & z <= z_conf.high))) %>%
unnest(data)
#> # A tibble: 322 x 19
#> # Groups: cut, color [30]
#> cut color carat clarity depth table price x y z x_estimate
#> <ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D 0.91 SI2 62.5 66 3079 6.08 6.01 3.78 6.02
#> 2 Fair D 0.9 SI2 65.7 60 3205 5.98 5.93 3.91 6.02
#> 3 Fair D 0.9 SI2 64.7 59 3205 6.09 5.99 3.91 6.02
#> 4 Fair D 0.95 SI2 64.4 60 3384 6.06 6.02 3.89 6.02
#> 5 Fair D 0.9 SI2 64.9 57 3473 6.03 5.98 3.9 6.02
#> 6 Fair D 0.9 SI2 64.5 61 3473 6.1 6 3.9 6.02
#> 7 Fair D 0.9 SI1 64.5 61 3689 6.05 6.01 3.89 6.02
#> 8 Fair D 0.91 SI1 64.7 61 3730 6.06 5.99 3.9 6.02
#> 9 Fair D 0.9 SI2 64.6 59 3847 6.04 6.01 3.89 6.02
#> 10 Fair D 0.91 SI1 64.4 60 3855 6.08 6.04 3.9 6.02
#> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
#> # z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-22 by the reprex package (v0.3.0)

Using package dplyr (which is more up-to-date than plyr) you can write as follows. "LB" and "UB" stand for "Lower Bound" and "Upper Bound" respectively.
library(dplyr)
A %>%
group_by(across(starts_with("BSTN") | starts_with("ASTN"))) %>%
summarise(
count = n(),
mean_Ttime = mean(Ttime),
mean_TFtime = mean(TFtime),
LB_Ttime = mean_Ttime - qnorm(0.975) * sd(Ttime) / sqrt(count),
UB_Ttime = mean_Ttime + qnorm(0.975) * sd(Ttime) / sqrt(count),
LB_TFtime = mean_TFtime - qnorm(0.975) * sd(TFtime) / sqrt(count),
UB_TFtime = mean_TFtime + qnorm(0.975) * sd(TFtime) / sqrt(count)
)
Output
# A tibble: 4 x 17
# Groups: BSTN, BSTN2, BSTN3, BSTN4, BSTN5, ASTN, ASTN1, ASTN2, ASTN3 [4]
# BSTN BSTN2 BSTN3 BSTN4 BSTN5 ASTN ASTN1 ASTN2 ASTN3 ASTN4 count mean_Ttime mean_TFtime LB_Ttime UB_Ttime LB_TFtime UB_TFtime
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1001 0 0 0 0 1703 0 0 0 0 1 1005 0 NA NA NA NA
# 2 1001 1703 0 0 0 1003 1703 0 0 0 1 2933 399 NA NA NA NA
# 3 1001 1703 0 0 0 1006 1703 0 0 0 1 2284 399 NA NA NA NA
# 4 1001 1703 0 0 0 1701 1703 0 0 0 3 1750 118 1750 1750 118 118
With this sample data we obtain several NAs because the count of the group is 1 in those cases, but when you have larger datasets it's rare that you will obtain them.

Related

How do i make the dataset displayed on the head() function to show all the columns clearly

I used the head() function to display the 1st six rows of my dataset but it doesn't show all the column names clearly. This is the way it appears:
head(activity)
# A tibble: 6 × 15
Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1503960366 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
2 1503960366 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
3 1503960366 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
4 1503960366 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
5 1503960366 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
6 1503960366 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
I saw someones work on kaggle and her own dataset using the head() function appeared this way
A data.frame: 6 × 15
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
<dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
1 1503960366 4/12/2016 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985
2 1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797
3 1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.40 3.91 0 30 11 181 1218 1776
4 1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745
5 1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.41 5.04 0 36 10 221 773 1863
6 1503960366 4/17/2016 9705 6.48 6.48 0
Please how do i make mine to show all the columns clearly.
you are displaying a tibble(), which only shows the first ten rows and all the columns that fit on one screen. You can control the default appearance with options.
Have you tried options(pillar.width = Inf) ?
See ?pillar::pillar_options and ?tibble_options for the available options

Window function on grouped dataframe get difference between two rows after a filter

A grouped data frame:
grp_diamonds <- diamonds %>%
group_by(cut, color) %>%
mutate(rn = row_number()) %>%
arrange(cut, color, rn) %>%
mutate(cumprice = cumsum(price))
Looks like:
grp_diamonds
# A tibble: 53,940 × 12
# Groups: cut, color [35]
carat cut color clarity depth table price x y z rn cumprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 1 2848
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 2 5706
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 3 8591
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 4 11565
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 5 14568
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 6 17615
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 7 20692
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 8 23771
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 9 26976
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 10 30181
Within each group, I would like to add a new field 'GROWTH_6_7' which is the delta between cumprice at rn = 7 - rn = 6.
Read documentation and tried and failed using cur_data() with mutate. Maybe that's the right path or maybe there's a 'better' way?
How can I mutate a new field within each group 'GROWTH_6_7' that is the difference between rn == 7 and rn ==6 cumprice?
We could do this within mutate itself`
library(dplyr)
grp_diamonds %>%
group_by(cut, color) %>%
mutate(GROWTH_6_7 = cumprice[rn == 7] - cumprice[rn == 6])
-output
# A tibble: 53,940 x 13
# Groups: cut, color [35]
carat cut color clarity depth table price x y z rn cumprice GROWTH_6_7
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 1 2848 3077
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 2 5706 3077
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 3 8591 3077
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 4 11565 3077
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 5 14568 3077
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 6 17615 3077
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 7 20692 3077
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 8 23771 3077
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 9 26976 3077
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 10 30181 3077
# … with 53,930 more rows
If there are cases where there are some missing values, then another option is pivot_wider
library(tidyr)
grp_diamonds %>%
ungroup %>%
select(cut, color, rn, cumprice) %>%
filter(rn %in% 6:7) %>%
pivot_wider(names_from = rn, values_from = cumprice) %>%
transmute(cut, color, GROWTH_6_7 = `7` - `6`) %>%
left_join(grp_diamonds, .)
GrowDelta <- function(data, start_row = 6, end_row = 7){
data$cumprice[end_row] - data$cumprice[start_row]
}
grp_diamonds %>%
summarize(GROWTH_6_7 = GrowDelta(cur_data()))
mutate instead of summarize should work, too. It will just repeat it for every row in the group instead of just once for each group, i.e. will result in tibble with the same number of rows as the data set. Using summarize will give you a 35 x 3 tibble.
You may try group_modify:
Code
grow <- grp_diamonds %>%
group_by(cut, color) %>%
group_modify(~{
.x %>%
mutate(GROWTH_6_7 = .x$cumprice[.x$rn == 7] - .x$cumprice[.x$rn == 6])
})
Output
> head(grow)
# A tibble: 6 x 13
# Groups: cut, color [1]
cut color carat clarity depth table price x y z rn cumprice GROWTH_6_7
<ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
1 Fair D 0.75 SI2 64.6 57 2848 5.74 5.72 3.7 1 2848 3077
2 Fair D 0.71 VS2 56.9 65 2858 5.89 5.84 3.34 2 5706 3077
3 Fair D 0.9 SI2 66.9 57 2885 6.02 5.9 3.99 3 8591 3077
4 Fair D 1 SI2 69.3 58 2974 5.96 5.87 4.1 4 11565 3077
5 Fair D 1.01 SI2 64.6 56 3003 6.31 6.24 4.05 5 14568 3077
6 Fair D 0.73 VS1 66 54 3047 5.56 5.66 3.7 6 17615 3077

mutate data list column in purrr block and get a static min by grouping numeric variable

Using diamonds as an example, I'd like to group by cut then add a row number for each grouping and then shuffle. Then I'd like to apply a transformation to price, in this case just price + 1 and then I'd like to find the price corresponding to row 1 and make that the value for the entire feature.
Tried:
mydiamonds <- diamonds %>%
group_by(cut) %>%
mutate(rownum = row_number()) %>%
nest %>%
mutate(data = map(data, ~ .x %>% sample_n(nrow(.x)))) %>%
mutate(data = map(data, ~ .x %>% mutate(InitialPrice = price + rownum)))
This gets me close:
mydiamonds$data[[1]] %>% head
# A tibble: 6 x 11
carat color clarity depth table price x y z rownum InitialPrice
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
1 0.4 E VS1 62.4 54 951 4.73 4.75 2.96 13792 14743
2 0.71 H VS2 60.9 55 2450 5.76 5.74 3.5 20808 23258
3 1.01 F VVS2 61 57 8688 6.52 6.46 3.96 6567 15255
4 0.62 G VS2 61.6 55 2321 5.51 5.53 3.4 20438 22759
5 0.77 F VS1 60.9 58 3655 5.91 5.95 3.61 1717 5372
6 1.37 G VVS2 62.3 55.5 12207 7.05 7.14 4.43 8013 20220
What I'd like to do from here is to find the value of InitialPrice corresponding to rownum == 1 and then overwrite InitialPrice to be that single value all the way down for each data frame in mydiamonds$data.
I tried mutating and mutating again in my last line like so:
mutate(data = map(data, ~ .x %>% mutate(InitialPrice = price + rownum) %>% mutate(InitialPrice = . %>% filter(rownum ==1) %>% pull(InitialPrice))))
However got error:
Error: Problem with mutate() input data.
x Problem with mutate() input InitialPrice.
x Input InitialPrice must be a vector, not a fseq/function object.
ℹ Input InitialPrice is . %>% filter(rownum == 1) %>% pull(InitialPrice).
ℹ Input data is map(...).
How could I do that?
We could wrap the . within braces
library(dplyr)
library(ggplot2)
library(purrr)
mydiamonds %>%
mutate(data = map(data, ~ .x %>%
mutate(InitialPrice = price + rownum ) %>%
mutate(InitialPrice = {.} %>%
filter(rownum ==1) %>%
pull(InitialPrice))))
# A tibble: 5 x 2
# Groups: cut [5]
# cut data
# <ord> <list>
#1 Ideal <tibble [21,551 × 11]>
#2 Premium <tibble [13,791 × 11]>
#3 Good <tibble [4,906 × 11]>
#4 Very Good <tibble [12,082 × 11]>
#5 Fair <tibble [1,610 × 11]>
You can do :
library(tidyverse)
result <- mydiamonds %>%
mutate(data = map(data, ~.x %>%
mutate(InitialPrice = InitialPrice[rownum == 1])))
result$data[[1]]
# A tibble: 21,551 x 11
# carat color clarity depth table price x y z rownum InitialPrice
# <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
# 1 0.7 I VVS1 61.8 56 2492 5.72 5.74 3.54 20897 327
# 2 0.51 G VS1 61.8 60 1757 5.08 5.12 3.15 18405 327
# 3 0.32 G VVS1 61.4 57 814 4.39 4.41 2.7 11820 327
# 4 0.33 H VVS1 62.5 56 901 4.44 4.42 2.77 13130 327
# 5 0.72 G SI2 62.1 54 2079 5.77 5.82 3.6 19769 327
# 6 1.31 G VVS2 59.2 59 11459 7.12 7.18 4.23 7807 327
# 7 0.32 F VVS2 61.6 55 945 4.41 4.42 2.72 13714 327
# 8 0.39 G VVS1 62.1 54.7 1008 4.64 4.72 2.91 14462 327
# 9 0.7 E VVS2 62.3 53.7 3990 5.67 5.72 3.55 2138 327
#10 0.71 D SI2 62.7 55 2551 5.67 5.71 3.57 21042 327
# … with 21,541 more rows

Combining filter, across, and starts_with to string search across columns in R

This is very similar to the answer given here, but I cannot figure out why starts_with does not work:
diamonds %>%
filter(across(clarity, ~ grepl('^S', .))) %>%
head
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
4 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
5 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
6 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
diamonds %>%
filter(across(starts_with("c"),~grepl("^S" ,.))) %>%
head
# A tibble: 0 x 10
# ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>, depth <dbl>, table <dbl>,
# price <int>, x <dbl>, y <dbl>, z <dbl>
dplyr before 1.0.4
diamonds %>%
filter(rowSums(across(starts_with("c"),~grepl("^S" ,.))) > 0)
# A tibble: 22,259 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 4 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 5 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
# 6 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
# 7 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
# 8 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
# 9 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
# 10 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
# # ... with 22,249 more rows
How to figure this out or confirm it:
diamonds %>%
filter({browser(); across(starts_with("c"),~grepl("^S" ,.)); })
# Called from: mask$eval_all_filter(dots, env_filter)
# debug at #1: across(starts_with("c"), ~grepl("^S", .))
across(starts_with("c"), ~ grepl("^S" , .))
# # A tibble: 53,940 x 4
# carat cut color clarity
# <lgl> <lgl> <lgl> <lgl>
# 1 FALSE FALSE FALSE TRUE
# 2 FALSE FALSE FALSE TRUE
# 3 FALSE FALSE FALSE FALSE
# 4 FALSE FALSE FALSE FALSE
# 5 FALSE FALSE FALSE TRUE
# 6 FALSE FALSE FALSE FALSE
# 7 FALSE FALSE FALSE FALSE
# 8 FALSE FALSE FALSE TRUE
# 9 FALSE FALSE FALSE FALSE
# 10 FALSE FALSE FALSE FALSE
# # ... with 53,930 more rows
To me, it seems apparent that one would want any row with at least one TRUE (or perhaps all, but I'll assume "any" for now). Since this is a frame of logicals, we can use rowSums, which should sum falses as 0 and trues as 1, so
head(rowSums(across(starts_with("c"), ~ grepl("^S" , .))) > 0)
# [1] TRUE TRUE FALSE FALSE TRUE FALSE
which is a single vector of logicals, one per row, which is what dplyr::filter ultimately wants/needs.
dplyr since 1.0.4
See https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
diamonds %>%
filter(if_any(across(starts_with("c"),~grepl("^S" ,.))))

Joining various summaries in dplyr

I have dozens of variables that I need to operate on by group, with different instructions to be done depending on the variable, usually as per the name of the variable, with a few ad hoc changes and renaming here and there.
A reprex using a modified diamonds dataset for illustration is below:
library(tidyverse)
diamond_renamed <- diamonds %>%
rename(size_x = x, size_y = y, size_z = z) %>%
rename(val_1 = depth, val_2 = table)
diamond_summary <- bind_cols(diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise(
cost = sum(price)
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("size")),
funs(median(.))
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("val")),
funs(mean(.))
)
)
diamond_summary
#> # A tibble: 276 x 15
#> # Groups: cut, color [?]
#> cut color clarity cost cut1 color1 clarity1 size_x size_y size_z
#> <ord> <ord> <ord> <int> <ord> <ord> <ord> <dbl> <dbl> <dbl>
#> 1 Fair D I1 29532 Fair D I1 7.32 7.20 4.70
#> 2 Fair D SI2 243888 Fair D SI2 6.13 6.06 3.99
#> 3 Fair D SI1 247854 Fair D SI1 6.08 6.04 3.93
#> 4 Fair D VS2 112822 Fair D VS2 6.04 6 3.65
#> 5 Fair D VS1 14606 Fair D VS1 5.56 5.58 3.66
#> 6 Fair D VVS2 32463 Fair D VVS2 4.95 4.84 3.31
#> 7 Fair D VVS1 13419 Fair D VVS1 4.92 5.03 3.28
#> 8 Fair D IF 4859 Fair D IF 4.68 4.73 2.88
#> 9 Fair E I1 18857 Fair E I1 6.18 6.14 4.03
#> 10 Fair E SI2 325446 Fair E SI2 6.28 6.20 3.95
#> # ... with 266 more rows, and 5 more variables: cut2 <ord>, color2 <ord>,
#> # clarity2 <ord>, val_1 <dbl>, val_2 <dbl>
This yields the desired result: a dataset with the grouped summaries... but it also repeats the grouped variables. It's also not great to have to repeat the group_by code itself everytime... but I'm not sure how else to do it. It may also not be the most efficient use of summarise. How can we avoid that repetition, make this code better?
Thank you!
One option would be to mutate instead of summarize in the initial steps and add those columns in the group_by
diamond_renamed %>%
group_by(cut, color, clarity) %>%
group_by(cost = sum(price), add = TRUE) %>%
mutate_at(vars(contains("size")), median) %>%
group_by_at(vars(contains("size")), .add = TRUE) %>%
summarise_at(vars(contains("val")), mean)
# A tibble: 276 x 9
# Groups: cut, color, clarity, cost, size_x, size_y [?]
# cut color clarity cost size_x size_y size_z val_1 val_2
# <ord> <ord> <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Fair D I1 29532 7.32 7.20 4.70 65.6 56.8
# 2 Fair D SI2 243888 6.13 6.06 3.99 64.7 58.6
# 3 Fair D SI1 247854 6.08 6.04 3.93 64.6 58.8
# 4 Fair D VS2 112822 6.04 6 3.65 62.7 60.3
# 5 Fair D VS1 14606 5.56 5.58 3.66 63.2 57.8
# 6 Fair D VVS2 32463 4.95 4.84 3.31 61.7 58.8
# 7 Fair D VVS1 13419 4.92 5.03 3.28 61.7 64.3
# 8 Fair D IF 4859 4.68 4.73 2.88 60.8 58
# 9 Fair E I1 18857 6.18 6.14 4.03 65.6 58.1
#10 Fair E SI2 325446 6.28 6.20 3.95 63.4 59.5
# ... with 266 more rows
NOTE: The grouping columns 'cut', 'color', 'clarity' are not repeated here as in the OP's post. So, it is only 9 columns instead of 15

Resources