I would like to group the rows of a sparklyr table by a specific column, and count the rows that fulfil a specific criteria.
For example, in the following diamonds table, I would like to group_by color, and count the number of rows with price >400.
> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)
> diamonds = copy_to(con, diamonds)
> diamonds
# Source: table<diamonds> [?? x 10]
# Database: spark_connection
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
7 0.240 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
8 0.260 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
9 0.220 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
10 0.230 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
This is a task that I would do in many ways in normal R. However none works in sparklyr.
For example:
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))
This works with a traditional data frame:
# A tibble: 7 x 3
color n n_expensive
<ord> <int> <int>
1 D 6775 6756
2 E 9797 9758
3 F 9542 9517
4 G 11292 11257
5 H 8304 8274
6 I 5422 5379
7 J 2808 2748
But not in spark:
diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;
Error in eval_bare(call, env) : object 'price' not found
You have to think in terms of SQL expressions here so for example if_else:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))
sum with cast:
diamonds_sdl %>% group_by(color) %>%
summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))
There could be a conflict with the type. Converting the logical to integer solves the problem
library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
library(ggplot2)
data(diamonds)
diamonds1 = copy_to(con, diamonds)
diamonds1 %>%
group_by(color) %>%
summarise(n=n(), n_expensive = sum(as.integer(price > 400)))
-output
Related
I have been trying to use dplyr in R to filter a large data frame that has some empty (NA) cells in it. The string I want to use is a vector containing several alphanumeric search terms.
My goal is to create a new data frame or tibble of the rows that contain ANY of the strings in the vector in ANY of the columns of the data frame.
I have tried several things with a data frame I cannot share, but I found an answer in another question that almost does what I need, except for using a vector as the search term.
From Filter rows which contain a certain string :
Filtering for rows where any column fulfils a condition
ggplot2::diamonds %>%
filter(if_any(everything(), ~ grepl('V',.))) %>%
head()
#> # A tibble: 6 × 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
Instead of V as the search term, what if I wanted to filter for a match to ANY value in a vector?
vector1 <- c("V", "F", "G", "E")
Some things I tried on my own data frame that worked for one value but not when using the vector as a search term:
dfdiamonds <- as.dataframe (ggplot2::diamonds)
`your text`test1 <- dfdiamonds %>%
rowwise() %>%
filter(any(c_across(cols=everything()) %in% c(vector1)
test2<- for(item in vector1) {
dfdiamonds %>%
rowwise() %>%
filter(any(c_across(cols=2) == item))
}
test3 <- filter(dfdiamonds, any(c_across(cols = everything()) %in% c(vector1))
#I tried grep for this one and it gave a result as a value rather than a data frame
matches <- unique (grep(paste(vector1,collapse="|"),
dfdiamonds, value=TRUE))
Anyway, I'm at a loss. Any solution will do!
Here is what you need:
ggplot2::diamonds %>%
filter(if_any(everything(), ~ grepl(paste0(vector1, collapse = "|"),.))) %>%
head()
The simplest solution in this case is probably:
library(tidyverse)
vector1 <- c("V", "F", "G", "E")
diamonds %>%
filter(if_any(everything(), ~ grepl(paste(vector1, collapse = "|"),.))) %>%
head()
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Created on 2023-01-24 with reprex v2.0.2
library(dplyr)
I am trying to write a simple function that outputs several dataframes. The issue I am having that I want to assign the name of the dataframes in a function.
get_data <- function(diamonds,cut, clarity) {
data_cut <- diamonds %>% filter(cut == cut)
return(assign(paste0( "data_", "cut"),data_cut))
data_cut_clarity <- data_cut %>% filter(clarity == clarity)
return(assign(paste0("data_", "cut", "_", clarity),data_cut_clarity))
}
So what I am trying to achieve, I want to run the following command:
get_data(diamonds,cut="Ideal", clarity="SI2")
so that output will be two dataframes: data_Ideal and data_Ideal_SI2
When I run a function, the return() does not work. I also tired using {{var}} notation to assign a name, but it did not work
It may be better not to use assign and create multiple objects in the global environment. Instead, create a single named list object and extract the elements with either $ or [[
library(dplyr)
library(ggplot2)
get_data <- function(diamonds,cut, clarity) {
data_cut <- diamonds %>%
filter(cut == cut)
data_cut_clarity <- data_cut %>%
filter(clarity == clarity)
dplyr::lst(!!stringr::str_c('data_', cut) := data_cut,
!!stringr::str_c('data_', clarity) := data_cut_clarity)
}
-testing
> out_lst <- get_data(diamonds, cut="Ideal", clarity="SI2")
> names(out_lst)
[1] "data_Ideal" "data_SI2"
> head(out_lst$data_Ideal)
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
> head(out_lst$data_SI2)
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
According to ?return
If the end of a function is reached without calling return, the value of the last evaluated expression is returned.
i.e. we don't need to be explicit with return unless it should be from an earlier step. In the above function, the last expression will be returned
In addition to := in naming on the lhs, we may also use list with setNames
get_data <- function(diamonds,cut, clarity) {
data_cut <- diamonds %>%
filter(cut == cut)
data_cut_clarity <- data_cut %>%
filter(clarity == clarity)
nm1 <- stringr::str_c('data_', c(cut, clarity))
setNames(list(data_cut, data_cut_clarity), nm1)
}
Using the diamonds dataset (from the ggplot2 library) as an example, I am trying to subset this table by columns and rows based on a vector of named elements (the names of the vector should be used to subset by columns and the corresponding vector elements by rows).
library(ggplot2)
diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
myVector <- c(cut="Ideal", cut="Good", color="E", color="J")
myVector
cut cut color color
"Ideal" "Good" "E" "J"
What I intend to do, would be something like follows but using myVector:
library(dplyr)
diamonds %>% subset(., (cut=="Ideal" | cut=="Good") & (color=="E" | color=="J")) %>%
select(cut, color)
Starting with the split idea of ThomasIsCoding, slightly changed, here is a base R solution based on having Reduce/Map created a logical index.
v <- split(unname(myVector), names(myVector))
i <- Reduce('&', Map(function(x, y){x %in% y}, diamonds[names(v)], v))
diamonds[i, ]
## A tibble: 6,039 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
# 5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
# 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
# 7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
# 8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
# 9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
#10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## ... with 6,029 more rows
Package dplyr
The code above can be written as a function and used in dplyr::filter.
# Input:
# X - a data set to be filtered
# values - a named list
values_in <- function(X, values){
v <- split(unname(values), names(values))
i <- Reduce('&', Map(function(x, y){x %in% y}, X[names(v)], v))
i
}
diamonds %>% filter( values_in(., myVector) )
The output is the same as above and, therefore, omited.
I am not sure if you want something like below
u <- split(myVector,names(myVector))
eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
such that
> eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
# A tibble: 6,039 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
# ... with 6,029 more rows
Using both approaches proposed by #Roman (generating all combinations of vector element and joining) and #ThomaslsCoding (splitting the vector) seems to do the trick:
data.frame(split(myVector, names(myVector))) %>%
expand.grid() %>%
inner_join(diamonds[,unique(names(myVector))])
you can try
my_vec_cut = myVector[names(myVector) == "cut"]
my_vec_color = myVector[names(myVector) == "color"]
I splitted the vector in two since you filter for two columns using and and or
diamonds %>%
filter(.data[[unique(names(my_vec_cut))]] %in% my_vec_cut & .data[[unique(names(my_vec_color))]] %in% my_vec_color)
A general way would be a joining approach. First you build all required combinations from your vector, then you left join the data.
library(tidyverse)
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color)
# A tibble: 4 x 2
cut color
<chr> <chr>
1 Good E
2 Good J
3 Ideal E
4 Ideal J
# now left_joining:
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color) %>%
left_join(diamonds)
count(cut, color)
Similar idea to #ThomasIsCoding's, just in base R.
al <- split(myVector, names(myVector))
res <- with(diamonds, diamonds[eval(parse(text=paste(sapply(names(al), function(x)
paste0(x, " %in% ", "al[['", x, "']]")), collapse=" & "))), ])
unique(res$cut)
# [1] Ideal Good
# Levels: Fair < Good < Very Good < Premium < Ideal
unique(res$color)
# [1] E J
# Levels: D < E < F < G < H < I < J
If you don't use the vector which has characters (and not expressions) as names, it gets a lot easier and maybe more readable:
library(ggplot2)
library(tidyverse)
library(rlang)
my_filter <- function(d, x, selection) {
cmd <- map2(x, selection, ~quo(`%in%`(!!.x, !!.y))) # create filter expression
d %>%
filter(!!!cmd) %>% # filter
select(!!!x) # select columns cut and color (in this case)
}
diamonds %>%
my_filter(x = vars(cut, color),
sel = list(c("Ideal", "Good"), c("E", "J")))
# # A tibble: 6,039 x 2
# cut color
# <ord> <ord>
# 1 Ideal E
# 2 Good E
# 3 Good J
# 4 Good J
# 5 Ideal J
# 6 Ideal J
# 7 Good J
# 8 Good J
# 9 Good E
# 10 Ideal J
# # ... with 6,029 more rows
library(ggplot2)
library(dplyr)
diamonds <- diamonds %>% select(-clarity)
# this works fine
# but doing it again gives me an error
diamonds %>% select(-clarity)
Error in is_character(x) : object 'clarity' not found
How do I do a safe drop/deselect?
You can do:
diamonds %>%
select(-one_of("clarity"))
If there is a non-existing variable:
diamonds %>%
select(-one_of("clarity", "clearness"))
it returns a warning:
Warning message:
Unknown columns: `clearness`
From dplyr 1.0.0, any_of() could be used:
diamonds %>%
select(-any_of(c("clarity", "clearness")))
Here's a slight twist using dplyr::select_if() that will not throw an Unknown columns: warning if the column name does not exist, in this case 'bad_column':
diamonds %>%
select_if(!names(.) %in% c('carat', 'cut', 'bad_column'))
Here's a simple modification to the one_of method shown by tmfmnk to work with symbols like select. The input is converted to quosures then to character.
library(tidyverse) # or just dplyr and purrr
drop_cols <- function(df, ...){
df %>%
select(-one_of(map_chr(enquos(...), quo_name)))
}
diamonds %>%
drop_cols(clarity, color, zebra)
# # A tibble: 53,940 x 8
# carat cut depth table price x y z
# <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
# Warning message:
# Unknown columns: `zebra`
I have dozens of variables that I need to operate on by group, with different instructions to be done depending on the variable, usually as per the name of the variable, with a few ad hoc changes and renaming here and there.
A reprex using a modified diamonds dataset for illustration is below:
library(tidyverse)
diamond_renamed <- diamonds %>%
rename(size_x = x, size_y = y, size_z = z) %>%
rename(val_1 = depth, val_2 = table)
diamond_summary <- bind_cols(diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise(
cost = sum(price)
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("size")),
funs(median(.))
),
diamond_renamed %>%
group_by(cut, color, clarity) %>%
summarise_at(
vars(contains("val")),
funs(mean(.))
)
)
diamond_summary
#> # A tibble: 276 x 15
#> # Groups: cut, color [?]
#> cut color clarity cost cut1 color1 clarity1 size_x size_y size_z
#> <ord> <ord> <ord> <int> <ord> <ord> <ord> <dbl> <dbl> <dbl>
#> 1 Fair D I1 29532 Fair D I1 7.32 7.20 4.70
#> 2 Fair D SI2 243888 Fair D SI2 6.13 6.06 3.99
#> 3 Fair D SI1 247854 Fair D SI1 6.08 6.04 3.93
#> 4 Fair D VS2 112822 Fair D VS2 6.04 6 3.65
#> 5 Fair D VS1 14606 Fair D VS1 5.56 5.58 3.66
#> 6 Fair D VVS2 32463 Fair D VVS2 4.95 4.84 3.31
#> 7 Fair D VVS1 13419 Fair D VVS1 4.92 5.03 3.28
#> 8 Fair D IF 4859 Fair D IF 4.68 4.73 2.88
#> 9 Fair E I1 18857 Fair E I1 6.18 6.14 4.03
#> 10 Fair E SI2 325446 Fair E SI2 6.28 6.20 3.95
#> # ... with 266 more rows, and 5 more variables: cut2 <ord>, color2 <ord>,
#> # clarity2 <ord>, val_1 <dbl>, val_2 <dbl>
This yields the desired result: a dataset with the grouped summaries... but it also repeats the grouped variables. It's also not great to have to repeat the group_by code itself everytime... but I'm not sure how else to do it. It may also not be the most efficient use of summarise. How can we avoid that repetition, make this code better?
Thank you!
One option would be to mutate instead of summarize in the initial steps and add those columns in the group_by
diamond_renamed %>%
group_by(cut, color, clarity) %>%
group_by(cost = sum(price), add = TRUE) %>%
mutate_at(vars(contains("size")), median) %>%
group_by_at(vars(contains("size")), .add = TRUE) %>%
summarise_at(vars(contains("val")), mean)
# A tibble: 276 x 9
# Groups: cut, color, clarity, cost, size_x, size_y [?]
# cut color clarity cost size_x size_y size_z val_1 val_2
# <ord> <ord> <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Fair D I1 29532 7.32 7.20 4.70 65.6 56.8
# 2 Fair D SI2 243888 6.13 6.06 3.99 64.7 58.6
# 3 Fair D SI1 247854 6.08 6.04 3.93 64.6 58.8
# 4 Fair D VS2 112822 6.04 6 3.65 62.7 60.3
# 5 Fair D VS1 14606 5.56 5.58 3.66 63.2 57.8
# 6 Fair D VVS2 32463 4.95 4.84 3.31 61.7 58.8
# 7 Fair D VVS1 13419 4.92 5.03 3.28 61.7 64.3
# 8 Fair D IF 4859 4.68 4.73 2.88 60.8 58
# 9 Fair E I1 18857 6.18 6.14 4.03 65.6 58.1
#10 Fair E SI2 325446 6.28 6.20 3.95 63.4 59.5
# ... with 266 more rows
NOTE: The grouping columns 'cut', 'color', 'clarity' are not repeated here as in the OP's post. So, it is only 9 columns instead of 15