group by and concatenate string using sparklyr - r

There are a number of questions asking precisely the same thing but none within the context of a sparklyr environment. How does one group by a column and then concatenate the values of some other column as a list?
For example the following results in the desired output in a local R environment.
mtcars %>%
distinct(gear, cyl) %>%
group_by(gear) %>%
summarize(test_list = paste0(cyl, collapse = ";")) %>%
select(gear, test_list) %>%
as.data.frame() %>%
print()
gear test_list
1 3 6;8;4
2 4 6;4
3 5 4;8;6
But registering that same table to spark and using the same code errors (sql parsing error, probably it attempts to apply spark's cocollapse function instead of R's C based collapse function) on the mutate (see code below). I know pyspark and spark SQL have collect_set() function that achieves the desired effect, is there something analogous for sparklyr?
sdf_copy_to(sc, x = mtcars, name = "mtcars_test")
tbl(sc, "mtcars_test") %>%
distinct(gear, cyl) %>%
group_by(gear) %>%
summarize(test_list = paste0(cyl, collapse = ";"))
Error:
Error : org.apache.spark.sql.catalyst.parser.ParseException:
In pyspark, the following approach is similar (except concatenated column is an array that can be collapsed).
from pyspark.sql.functions import collect_set
df2 = spark.table("mtcars_test")
df2.groupby("gear").agg(collect_set('cyl')).createOrReplaceTempView("mtcars_test_cont")
display(spark.table("mtcars_test_cont"))
gear collect_set(cyl)
3 [8, 4, 6]
4 [4, 6]
5 [8, 4, 6]

Instead of using R functions, you could have used Spark SQL syntax directly by wrapping it inside sql function from dbplyr. Below is an example script to get desired output:
sdf_copy_to(sc, x = mtcars, name = "mtcars_test")
tbl(sc, "mtcars_test") %>%
group_by(gear) %>%
summarize(test_list = sql("array_join(collect_set(cast(cyl as int)), ';')"))
#> gear test_list
#> <dbl> <chr>
#> 4 6;4
#> 3 6;4;8
#> 5 6;4;8
I just changed the last line of your code where you used paste0 function.
This is one reason why I prefer SparkR more than sparklyr, as almost all the syntax of PySpark works in the same manner.
SparkR::agg(
SparkR::group_by(SparkR::createDataFrame(mtcars), SparkR::column("gear")),
test_list = SparkR::array_join(
SparkR::collect_set(SparkR::cast(SparkR::column("cyl"), "integer")),
";"
)
) %>%
SparkR::collect()
#> gear test_list
#> 4 6;4
#> 3 6;4;8
#> 5 6;4;8

Related

Reconcile dataset *column types* (formats) using a dictionary/list in R/dplyr

Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.
I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("factor", "numeric")
)
I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.
And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.
#ronakShah in the previous query proposed
pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
flatten_chr() -> val
mtcars_all <- list(mtcarsA,mtcarsB) %>%
map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
Which works great in the previous example but not if the formats vary. Here it throws error:
Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.
This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.
Desired output:
mtcars_all
# A tibble: 4 x 3
mpg_true cyl_true disp
<factor> <numeric> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
Something simpler:
library(magrittr) # %<>% is cool
library(dplyr)
# The renaming is easy:
renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name
mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))
# Formatting is a little harder:
formats <- dic$true_format
names(formats) <- dic$true_name
lapply(names(formats), function (x) {
# there's no nice programmatic way to do this, I think
coercer <- switch(formats[[x]],
factor = as.factor,
numeric = as.numeric,
warning("Unrecognized format")
)
mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})
mtcars_all <- bind_rows(mtcarsA, mtcarsB)
In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.
I took another approach than Ronak's to read the dictionary. It is more verbose but I find it a bit more readable. A benchmark would be interesting to see which one is faster ;-)
Unfortunately, it seems that you cannot blindly cast a variable to a factor so I switched to character instead. In practice, it should behave exactly like a factor and you can call as_factor() on the end object if this is very important to you. Another possibility would be to store a casting function name (such as as_factor()) in the dictionary, retrieve it using get() and use it instead of as().
library(tidyverse)
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("numeric", "character") #instead of factor
)
dic2 = dic %>%
pivot_longer(-c(true_name, true_format), names_to=NULL)
read_dic = function(key, dict=dic2){
x = dict[dict$value==key,][["true_name"]]
if(length(x)!=1) x=key
x
}
rename_from_dic = function(df, dict=dic2){
rename_with(df, ~{
map_chr(.x, ~read_dic(.x, dict))
})
}
cast_from_dic = function(df, dict=dic){
mutate(df, across(everything(), ~{
cl=dict[dict$true_name==cur_column(),][["true_format"]]
if(length(cl)!=1) cl=class(.x)
as(.x, cl, strict=FALSE)
}))
}
list(mtcarsA,mtcarsB) %>%
map(rename_from_dic) %>%
map_df(cast_from_dic)
#> # A tibble: 4 x 3
#> mpg_true cyl_true disp
#> <dbl> <chr> <dbl>
#> 1 21 6 160
#> 2 21 6 160
#> 3 22.8 4 108
#> 4 21.4 6 258
Created on 2021-05-09 by the reprex package (v2.0.0)

How can I dynamically build a string and pass it to dplyr's mutate() function in R?

I asked a similar question before (Link). The given answer works fine. However, it turns out, that it does not fully apply to my use case.
Please consider the following minimal working example:
library(RSQLite)
library(dplyr)
library(dbplyr)
library(DBI)
library(stringr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, mtcars, "mtcars", temporary = FALSE)
db <- tbl(con, "mtcars") %>%
select(carb) %>%
distinct(carb) %>%
arrange(carb) %>%
mutate(Q1=1, Q2=2, Q3=3, Q4=4) %>%
collect()
I am interested in dynamically building the string Q1=1, Q2=2, Q3=3, Q4=4 such that it could be Q1=1, Q2=2, ..., Qn = n.
One idea I had is to build the string like that:
n_par <- 4
str_c('Q', 1:n_par, ' = ', 1:n_par, collapse = ', ')
such that n_par could be any positive number. However, due to dplyr's non-standard evaluation, I cannot make it work like that. However, this is exactly what I need.
Can somebody help?
Generating and evaluating the string
Q1 = 1, Q2 = 2, Q3 = 3, Q4 = 4 is not a string in the same way that "Q1 = 1, Q2 = 2, Q3 = 3, Q4 = 4" is a string. There are some R functions that will take a string object and evaluate it as code. For example:
> eval(parse(text="print('hello world')"))
#> [1] "hello world"
However, this may not play nicely inside dbplyr translation. If you manage to get something like this approach working it would be good to see it posted as an answer.
Using a loop
Instead of doing it as a single string, an alternative is to use a loop:
db <- tbl(con, "mtcars") %>%
select(carb) %>%
distinct(carb) %>%
arrange(carb)
for(i in 1:n){
var = paste0("Q",i)
db <- db %>%
mutate(!!sym(var) := i)
}
db <- collect(db)
The !!sym() is required to tell dplyr that you want the text argument treated as a variable. Lazy evaluation can give you odd results without it. The := assignment is required because the LHS needs to be evaluated.
This approach is roughly equivalent to one mutate statement for each variable (example below), but the dbplyr translation might not look as elegant as doing it all within a single mutate statement.
db <- tbl(con, "mtcars") %>%
select(carb) %>%
distinct(carb) %>%
arrange(carb) %>%
mutate(Q1 = 1) %>%
mutate(Q2 = 2) %>%
...
mutate(Qn = n) %>%
collect()
I recently read more about the topic and I found that the following code works quite nicely, causing dbplyr to write a cleaner SQL code.
# Libraries
library(RSQLite)
library(dplyr)
library(dbplyr)
library(DBI)
# Example database
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, mtcars, "mtcars", temporary = FALSE)
# Parameter for number of variables to be created
n <- 4
# Variable list
var <- list()
for(i in 1:n){
j <- paste0("Q", i)
var[[j]] <- i
}
# Query/computation
db <- tbl(con, "mtcars") %>%
select(carb) %>%
distinct(carb) %>%
arrange(carb) %>%
mutate(!!! var) %>%
show_query() %>%
collect()
The trick was to build a list with proper names and to put it into the mutate() function using !!!. Furthermore, I read that parsing and evaluating strings should be avoided, so I switched to lists.
Does this work in in your database?
library(tidyverse)
q_n <- function(n) {
str_c('Q', 1:n, ' = ', 1:n, collapse = ', ')
}
create_n_string <- function(data,n = 5,string = "Q"){
data %>%
mutate(new_col = str_flatten(1:n,collapse = "_")) %>%
separate(new_col,into = string %>% str_c(1:n),sep = "_")
}
mtcars %>%
select(carb) %>%
distinct(carb) %>%
arrange(carb) %>%
create_n_string()
#> carb Q1 Q2 Q3 Q4 Q5
#> 1 1 1 2 3 4 5
#> 2 2 1 2 3 4 5
#> 3 3 1 2 3 4 5
#> 4 4 1 2 3 4 5
#> 5 6 1 2 3 4 5
#> 6 8 1 2 3 4 5
Created on 2020-01-22 by the reprex package (v0.3.0)

Sparklyr : force allocation to use functions such as n_distinct, match [duplicate]

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

count number of unique elements in each columns with dplyr in sparklyr

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

Using Dplyr In A Function To Create New Dataframes

I'm Trying to create new dataframes from dplyr 0.4.3 functions using R 3.2.2.
What I want to do is create some new dataframes using dplyr::filter to separate out data from one ginormous dataframe into a bunch of smaller dataframes.
For my reproducible base case bog simple example, I used this:
filter(mtcars, cyl == 4)
I know I need to assign that to a dataframe of its own, so I started with:
paste("Cylinders:", x, sep = "") <- filter(mtcars, cyl == 4))
That didn't work -- it gave me the error found here: Assignment Expands to Non-Language Object
From there, I found this: Create A Variable Name with Paste in R
(also, big ups to the authors of the above)
And that led me to this, which works:
assign(paste("gears_cars_cylinders", 4, sep = "_"), filter(mtcars, cyl == 4)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
and by "works," I mean I get a dataframe named gears_cars_cylinders_4 with all the goodies from
filter(mtcars, cyl == 4) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
But ultimately, I think I need to wrap this whole thing in a function and be able to feed it the cylinder numbers from mtcars$cyl. I'm thinking something like plyr::ldply(mtcars$cyl, function_name)?
In my real-life data, I have about 70 different classes I need to split out into separate dataframes to drop into DT::datatable tabs in Shiny, which is a whole nuther mess. Anyway.
When I try this:
function_name <- function(x){
assign(paste("gears_cars_cylinders", x, sep = "_"), filter(mtcars, cyl == x)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
}
and then function_name(6),
I get the output of the dataframe to the screen, but not a dataframe with the name.
Am I looking right over the answer here?
You need to assign the new data frames into the environment from which you're calling function_name(). Try something like this:
library(dplyr)
foo <- function(x) {
assign(paste("gears_cars_cylinders", x, sep = "_"),
envir = parent.frame(),
value = mtcars %>%
filter(cyl == x) %>%
count(gear))
}
for(cyl in sort(unique(mtcars$cyl))) foo(cyl)
ls()
#> [1] "cyl" "foo"
#> [3] "gears_cars_cylinders_4" "gears_cars_cylinders_6"
#> [5] "gears_cars_cylinders_8"
gears_cars_cylinders_4
#> Source: local data frame [3 x 2]
#>
#> gear n
#> (dbl) (int)
#> 1 3 1
#> 2 4 8
#> 3 5 2

Resources