I have dozens of variables that I need to operate on by group, with different instructions to be done depending on the variable, usually as per the name of the variable, with a few ad hoc changes and renaming here and there.
A reprex using a modified diamonds dataset for illustration is below:
library(tidyverse)
diamond_renamed <- diamonds %>%
rename(size_x = x, size_y = y, size_z = z) %>%
rename(val_1 = depth, val_2 = table)
diamond_summary <- bind_cols(diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise(
cost = sum(price)
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("size")),
funs(median(.))
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("val")),
funs(mean(.))
)
)
diamond_summary
#> # A tibble: 276 x 15
#> # Groups: cut, color [?]
#> cut color clarity cost cut1 color1 clarity1 size_x size_y size_z
#> <ord> <ord> <ord> <int> <ord> <ord> <ord> <dbl> <dbl> <dbl>
#> 1 Fair D I1 29532 Fair D I1 7.32 7.20 4.70
#> 2 Fair D SI2 243888 Fair D SI2 6.13 6.06 3.99
#> 3 Fair D SI1 247854 Fair D SI1 6.08 6.04 3.93
#> 4 Fair D VS2 112822 Fair D VS2 6.04 6 3.65
#> 5 Fair D VS1 14606 Fair D VS1 5.56 5.58 3.66
#> 6 Fair D VVS2 32463 Fair D VVS2 4.95 4.84 3.31
#> 7 Fair D VVS1 13419 Fair D VVS1 4.92 5.03 3.28
#> 8 Fair D IF 4859 Fair D IF 4.68 4.73 2.88
#> 9 Fair E I1 18857 Fair E I1 6.18 6.14 4.03
#> 10 Fair E SI2 325446 Fair E SI2 6.28 6.20 3.95
#> # ... with 266 more rows, and 5 more variables: cut2 <ord>, color2 <ord>,
#> # clarity2 <ord>, val_1 <dbl>, val_2 <dbl>
This yields the desired result: a dataset with the grouped summaries... but it also repeats the grouped variables. It's also not great to have to repeat the group_by code itself everytime... but I'm not sure how else to do it. It may also not be the most efficient use of summarise. How can we avoid that repetition, make this code better?
Thank you!
One option would be to mutate instead of summarize in the initial steps and add those columns in the group_by
diamond_renamed %>%
group_by(cut, color, clarity) %>%
group_by(cost = sum(price), add = TRUE) %>%
mutate_at(vars(contains("size")), median) %>%
group_by_at(vars(contains("size")), .add = TRUE) %>%
summarise_at(vars(contains("val")), mean)
# A tibble: 276 x 9
# Groups: cut, color, clarity, cost, size_x, size_y [?]
# cut color clarity cost size_x size_y size_z val_1 val_2
# <ord> <ord> <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Fair D I1 29532 7.32 7.20 4.70 65.6 56.8
# 2 Fair D SI2 243888 6.13 6.06 3.99 64.7 58.6
# 3 Fair D SI1 247854 6.08 6.04 3.93 64.6 58.8
# 4 Fair D VS2 112822 6.04 6 3.65 62.7 60.3
# 5 Fair D VS1 14606 5.56 5.58 3.66 63.2 57.8
# 6 Fair D VVS2 32463 4.95 4.84 3.31 61.7 58.8
# 7 Fair D VVS1 13419 4.92 5.03 3.28 61.7 64.3
# 8 Fair D IF 4859 4.68 4.73 2.88 60.8 58
# 9 Fair E I1 18857 6.18 6.14 4.03 65.6 58.1
#10 Fair E SI2 325446 6.28 6.20 3.95 63.4 59.5
# ... with 266 more rows
NOTE: The grouping columns 'cut', 'color', 'clarity' are not repeated here as in the OP's post. So, it is only 9 columns instead of 15
Related
A grouped data frame:
grp_diamonds <- diamonds %>%
group_by(cut, color) %>%
mutate(rn = row_number()) %>%
arrange(cut, color, rn) %>%
mutate(cumprice = cumsum(price))
Looks like:
grp_diamonds
# A tibble: 53,940 × 12
# Groups: cut, color [35]
carat cut color clarity depth table price x y z rn cumprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 1 2848
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 2 5706
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 3 8591
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 4 11565
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 5 14568
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 6 17615
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 7 20692
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 8 23771
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 9 26976
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 10 30181
Within each group, I would like to add a new field 'GROWTH_6_7' which is the delta between cumprice at rn = 7 - rn = 6.
Read documentation and tried and failed using cur_data() with mutate. Maybe that's the right path or maybe there's a 'better' way?
How can I mutate a new field within each group 'GROWTH_6_7' that is the difference between rn == 7 and rn ==6 cumprice?
We could do this within mutate itself`
library(dplyr)
grp_diamonds %>%
group_by(cut, color) %>%
mutate(GROWTH_6_7 = cumprice[rn == 7] - cumprice[rn == 6])
-output
# A tibble: 53,940 x 13
# Groups: cut, color [35]
carat cut color clarity depth table price x y z rn cumprice GROWTH_6_7
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 1 2848 3077
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 2 5706 3077
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 3 8591 3077
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 4 11565 3077
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 5 14568 3077
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 6 17615 3077
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 7 20692 3077
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 8 23771 3077
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 9 26976 3077
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 10 30181 3077
# … with 53,930 more rows
If there are cases where there are some missing values, then another option is pivot_wider
library(tidyr)
grp_diamonds %>%
ungroup %>%
select(cut, color, rn, cumprice) %>%
filter(rn %in% 6:7) %>%
pivot_wider(names_from = rn, values_from = cumprice) %>%
transmute(cut, color, GROWTH_6_7 = `7` - `6`) %>%
left_join(grp_diamonds, .)
GrowDelta <- function(data, start_row = 6, end_row = 7){
data$cumprice[end_row] - data$cumprice[start_row]
}
grp_diamonds %>%
summarize(GROWTH_6_7 = GrowDelta(cur_data()))
mutate instead of summarize should work, too. It will just repeat it for every row in the group instead of just once for each group, i.e. will result in tibble with the same number of rows as the data set. Using summarize will give you a 35 x 3 tibble.
You may try group_modify:
Code
grow <- grp_diamonds %>%
group_by(cut, color) %>%
group_modify(~{
.x %>%
mutate(GROWTH_6_7 = .x$cumprice[.x$rn == 7] - .x$cumprice[.x$rn == 6])
})
Output
> head(grow)
# A tibble: 6 x 13
# Groups: cut, color [1]
cut color carat clarity depth table price x y z rn cumprice GROWTH_6_7
<ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
1 Fair D 0.75 SI2 64.6 57 2848 5.74 5.72 3.7 1 2848 3077
2 Fair D 0.71 VS2 56.9 65 2858 5.89 5.84 3.34 2 5706 3077
3 Fair D 0.9 SI2 66.9 57 2885 6.02 5.9 3.99 3 8591 3077
4 Fair D 1 SI2 69.3 58 2974 5.96 5.87 4.1 4 11565 3077
5 Fair D 1.01 SI2 64.6 56 3003 6.31 6.24 4.05 5 14568 3077
6 Fair D 0.73 VS1 66 54 3047 5.56 5.66 3.7 6 17615 3077
A list of data frames:
mylist <- diamonds %>%
mutate(somenum = rnorm(nrow(.))) %>%
group_by(cut, color) %>%
group_split %>%
map(~ list(dta = ., initial_val = rnorm(1)))
E.g. the first item in mylist:
mylist[1]
[[1]]
[[1]]$dta
# A tibble: 163 x 11
carat cut color clarity depth table price x y z somenum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 0.302
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 0.500
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 0.179
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 1.25
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 -0.731
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 0.758
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 -1.43
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 0.820
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 -1.81
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 -0.179
# … with 153 more rows
[[1]]$initial_val
[1] 1.788348
The list item contains a data frame as well as a number 'initial_val'.
For each data frame in the list, I would like to mutate onto it a new field 'cumsum_someval' that starts with the initial_val for the list item and then 'builds it up' by adding the lagged cumsum of initial_val with the rows entry for somenum. E.g. the first row, the value of cumsum_someval will just be the initial_val 1.788348. But the second row of cumsum_someval should be 1.788348 + 0.302 = 2.090348. Then, the 3rd row would be 2.090348 + 0.500 = 2.590348. And so on.
Perhaps something like purrr::map with a custom func along the lines...?
myfun <- function(dta, initial_val) {
cum_val = initial_val + dta$somenum[<rownumber here>]
}
Open to suggestions.
For each data frame in mylist, how can I build up this new mutated field that starts with initial_val and then progresses to sum the lag of each instance of somenum?
We loop over the list with map, extract the 'dta' dataset, created the new 'cum_val', by taking the lag of 'somenum', specify the default as 'initial_val', do the cumsum, assign the output back to 'dta' and return the whole list element i.e. .x
library(purrr)
library(dplyr)
map(mylist, ~ {
.x$dta <- .x$dta %>%
mutate(cum_val = cumsum(lag(somenum, default = .x$initial_val)))
.x
})
Although this is not as elegant as Arun's solution, I will dedicate it to him and also to Doug for presenting us with very good challenges every once in a while:
library(purrr)
mylist %>%
map_dbl("initial_val") %>%
map2(mylist, function(a, b) {
b %>% imap(~ if(.y == "dta") {
.x %>% mutate(cumsum_someval = accumulate(c(a, .x$somenum[-nrow(.x)]), `+`))
} else {
.x
})
}) %>% `[`(1)
[[1]]
[[1]]$dta
# A tibble: 163 x 12
carat cut color clarity depth table price x y z somenum cumsum_someval
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.7 0.0684 0.0391
2 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 0.436 0.108
3 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 -0.0591 0.543
4 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 1.08 0.484
5 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 -0.478 1.57
6 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 -0.600 1.09
7 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 -0.825 0.487
8 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 -1.09 -0.338
9 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 -0.672 -1.42
10 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 -0.273 -2.10
# ... with 153 more rows
[[1]]$initial_val
[1] 0.03913573
Using diamonds as an example, I'd like to group by cut then add a row number for each grouping and then shuffle. Then I'd like to apply a transformation to price, in this case just price + 1 and then I'd like to find the price corresponding to row 1 and make that the value for the entire feature.
Tried:
mydiamonds <- diamonds %>%
group_by(cut) %>%
mutate(rownum = row_number()) %>%
nest %>%
mutate(data = map(data, ~ .x %>% sample_n(nrow(.x)))) %>%
mutate(data = map(data, ~ .x %>% mutate(InitialPrice = price + rownum)))
This gets me close:
mydiamonds$data[[1]] %>% head
# A tibble: 6 x 11
carat color clarity depth table price x y z rownum InitialPrice
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
1 0.4 E VS1 62.4 54 951 4.73 4.75 2.96 13792 14743
2 0.71 H VS2 60.9 55 2450 5.76 5.74 3.5 20808 23258
3 1.01 F VVS2 61 57 8688 6.52 6.46 3.96 6567 15255
4 0.62 G VS2 61.6 55 2321 5.51 5.53 3.4 20438 22759
5 0.77 F VS1 60.9 58 3655 5.91 5.95 3.61 1717 5372
6 1.37 G VVS2 62.3 55.5 12207 7.05 7.14 4.43 8013 20220
What I'd like to do from here is to find the value of InitialPrice corresponding to rownum == 1 and then overwrite InitialPrice to be that single value all the way down for each data frame in mydiamonds$data.
I tried mutating and mutating again in my last line like so:
mutate(data = map(data, ~ .x %>% mutate(InitialPrice = price + rownum) %>% mutate(InitialPrice = . %>% filter(rownum ==1) %>% pull(InitialPrice))))
However got error:
Error: Problem with mutate() input data.
x Problem with mutate() input InitialPrice.
x Input InitialPrice must be a vector, not a fseq/function object.
ℹ Input InitialPrice is . %>% filter(rownum == 1) %>% pull(InitialPrice).
ℹ Input data is map(...).
How could I do that?
We could wrap the . within braces
library(dplyr)
library(ggplot2)
library(purrr)
mydiamonds %>%
mutate(data = map(data, ~ .x %>%
mutate(InitialPrice = price + rownum ) %>%
mutate(InitialPrice = {.} %>%
filter(rownum ==1) %>%
pull(InitialPrice))))
# A tibble: 5 x 2
# Groups: cut [5]
# cut data
# <ord> <list>
#1 Ideal <tibble [21,551 × 11]>
#2 Premium <tibble [13,791 × 11]>
#3 Good <tibble [4,906 × 11]>
#4 Very Good <tibble [12,082 × 11]>
#5 Fair <tibble [1,610 × 11]>
You can do :
library(tidyverse)
result <- mydiamonds %>%
mutate(data = map(data, ~.x %>%
mutate(InitialPrice = InitialPrice[rownum == 1])))
result$data[[1]]
# A tibble: 21,551 x 11
# carat color clarity depth table price x y z rownum InitialPrice
# <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
# 1 0.7 I VVS1 61.8 56 2492 5.72 5.74 3.54 20897 327
# 2 0.51 G VS1 61.8 60 1757 5.08 5.12 3.15 18405 327
# 3 0.32 G VVS1 61.4 57 814 4.39 4.41 2.7 11820 327
# 4 0.33 H VVS1 62.5 56 901 4.44 4.42 2.77 13130 327
# 5 0.72 G SI2 62.1 54 2079 5.77 5.82 3.6 19769 327
# 6 1.31 G VVS2 59.2 59 11459 7.12 7.18 4.23 7807 327
# 7 0.32 F VVS2 61.6 55 945 4.41 4.42 2.72 13714 327
# 8 0.39 G VVS1 62.1 54.7 1008 4.64 4.72 2.91 14462 327
# 9 0.7 E VVS2 62.3 53.7 3990 5.67 5.72 3.55 2138 327
#10 0.71 D SI2 62.7 55 2551 5.67 5.71 3.57 21042 327
# … with 21,541 more rows
Using the diamonds dataset (from the ggplot2 library) as an example, I am trying to subset this table by columns and rows based on a vector of named elements (the names of the vector should be used to subset by columns and the corresponding vector elements by rows).
library(ggplot2)
diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
myVector <- c(cut="Ideal", cut="Good", color="E", color="J")
myVector
cut cut color color
"Ideal" "Good" "E" "J"
What I intend to do, would be something like follows but using myVector:
library(dplyr)
diamonds %>% subset(., (cut=="Ideal" | cut=="Good") & (color=="E" | color=="J")) %>%
select(cut, color)
Starting with the split idea of ThomasIsCoding, slightly changed, here is a base R solution based on having Reduce/Map created a logical index.
v <- split(unname(myVector), names(myVector))
i <- Reduce('&', Map(function(x, y){x %in% y}, diamonds[names(v)], v))
diamonds[i, ]
## A tibble: 6,039 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
# 5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
# 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
# 7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
# 8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
# 9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
#10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## ... with 6,029 more rows
Package dplyr
The code above can be written as a function and used in dplyr::filter.
# Input:
# X - a data set to be filtered
# values - a named list
values_in <- function(X, values){
v <- split(unname(values), names(values))
i <- Reduce('&', Map(function(x, y){x %in% y}, X[names(v)], v))
i
}
diamonds %>% filter( values_in(., myVector) )
The output is the same as above and, therefore, omited.
I am not sure if you want something like below
u <- split(myVector,names(myVector))
eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
such that
> eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
# A tibble: 6,039 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
# ... with 6,029 more rows
Using both approaches proposed by #Roman (generating all combinations of vector element and joining) and #ThomaslsCoding (splitting the vector) seems to do the trick:
data.frame(split(myVector, names(myVector))) %>%
expand.grid() %>%
inner_join(diamonds[,unique(names(myVector))])
you can try
my_vec_cut = myVector[names(myVector) == "cut"]
my_vec_color = myVector[names(myVector) == "color"]
I splitted the vector in two since you filter for two columns using and and or
diamonds %>%
filter(.data[[unique(names(my_vec_cut))]] %in% my_vec_cut & .data[[unique(names(my_vec_color))]] %in% my_vec_color)
A general way would be a joining approach. First you build all required combinations from your vector, then you left join the data.
library(tidyverse)
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color)
# A tibble: 4 x 2
cut color
<chr> <chr>
1 Good E
2 Good J
3 Ideal E
4 Ideal J
# now left_joining:
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color) %>%
left_join(diamonds)
count(cut, color)
Similar idea to #ThomasIsCoding's, just in base R.
al <- split(myVector, names(myVector))
res <- with(diamonds, diamonds[eval(parse(text=paste(sapply(names(al), function(x)
paste0(x, " %in% ", "al[['", x, "']]")), collapse=" & "))), ])
unique(res$cut)
# [1] Ideal Good
# Levels: Fair < Good < Very Good < Premium < Ideal
unique(res$color)
# [1] E J
# Levels: D < E < F < G < H < I < J
If you don't use the vector which has characters (and not expressions) as names, it gets a lot easier and maybe more readable:
library(ggplot2)
library(tidyverse)
library(rlang)
my_filter <- function(d, x, selection) {
cmd <- map2(x, selection, ~quo(`%in%`(!!.x, !!.y))) # create filter expression
d %>%
filter(!!!cmd) %>% # filter
select(!!!x) # select columns cut and color (in this case)
}
diamonds %>%
my_filter(x = vars(cut, color),
sel = list(c("Ideal", "Good"), c("E", "J")))
# # A tibble: 6,039 x 2
# cut color
# <ord> <ord>
# 1 Ideal E
# 2 Good E
# 3 Good J
# 4 Good J
# 5 Ideal J
# 6 Ideal J
# 7 Good J
# 8 Good J
# 9 Good E
# 10 Ideal J
# # ... with 6,029 more rows
I would like to group the rows of a sparklyr table by a specific column, and count the rows that fulfil a specific criteria.
For example, in the following diamonds table, I would like to group_by color, and count the number of rows with price >400.
> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)
> diamonds = copy_to(con, diamonds)
> diamonds
# Source: table<diamonds> [?? x 10]
# Database: spark_connection
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
7 0.240 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
8 0.260 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
9 0.220 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
10 0.230 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
This is a task that I would do in many ways in normal R. However none works in sparklyr.
For example:
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))
This works with a traditional data frame:
# A tibble: 7 x 3
color n n_expensive
<ord> <int> <int>
1 D 6775 6756
2 E 9797 9758
3 F 9542 9517
4 G 11292 11257
5 H 8304 8274
6 I 5422 5379
7 J 2808 2748
But not in spark:
diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;
Error in eval_bare(call, env) : object 'price' not found
You have to think in terms of SQL expressions here so for example if_else:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))
sum with cast:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))
There could be a conflict with the type. Converting the logical to integer solves the problem
library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
library(ggplot2)
data(diamonds)
diamonds1 = copy_to(con, diamonds)
diamonds1 %>%
group_by(color) %>%
summarise(n=n(), n_expensive = sum(as.integer(price > 400)))
-output