count number of unique elements in each columns with dplyr in sparklyr - r

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```

library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.

Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

Related

dplyr works differently for dataframe and database - how to resolve?

Apparently, dplyr requires different implementation for dataframes and databases.
This is not the first time I encounter this.
The example code is below. The purpose of the code is to remove Inf values from the database.
library(RSQLite)
library(DBI)
# dataframe
data <- data.frame(x = c(rep(1,2), rep(Inf, 3), rep(1, 5)),
y = c(rep(2,5), rep(Inf, 5)),
z = 1:10)
# database
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(db, data, name = "data", overwrite = TRUE)
data_db <- tbl(db, "data")
# WORKS for dataframe:
data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# DOES NOT WORK for database
data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
The last line returns an error:
Error in eval_bare(call, env) : object 'x' not found
Is there a good overview of the differences required in implementations of dplyr for dataframes vs. databases?
Please help to have the code above work for the database case.
Thank you
Looking through dbplyr's Function translation vignette, it doesn't look like is.finite() is mentioned, and indeed we can verify that it doesn't know how to translate is.infinite to the SQL command ISFINITE().
dbplyr::translate_sql(is.infinite(x))
# <SQL> is.infinite(`x`)
In this case, as per the Writing SQL with dbplyr vignette, you can use the sql command, something like:
## write idea, but probably won't work
data_db %>% filter(across(c("x", "y"), sql(NOT ISFINITE(.))))
Though that admittedly looks like it has a low chance of working because of the . in the SQL. I'm not sure how well across() (or the older filter_at) dplyr functions play with the sql() function. You may need to write out the columns you want to filter on:
data_db %>% filter(sql(NOT ISFINITE(x)), sql(NOT ISFINITE(y)))
Is there a good overview of the differences required in implementations of dplyr for dataframes vs. databases?
The vignettes mentioned above are both good reading on this.
You could turn it into a proper data.frame or tibble:
data_db <- tbl(db, "data") %>% as_tibble
# WORKS for dataframe:
data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# DOES NOT WORK for database
data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
Output:
> data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
x y z
1 1 2 1
2 1 2 2
> data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# A tibble: 2 x 3
x y z
<dbl> <dbl> <int>
1 1 2 1
2 1 2 2
You could use as.data.frame instead of as_tibble if that is more to your liking.
I have figured it out (almost).
The problem with SQLite is that it does not have function ISFINITE causing Error: no such function: ISFINITE (as I posted in the comment to Gregor's answer).
My solution is to use basic math logic to define Inf:
> data_db %>% filter(sql("x * x != x or x = 0 or x = 1"),
+ sql("y * y != y or y = 0 or y = 1"))
# Source: lazy query [?? x 3]
# Database: sqlite 3.33.0 [:memory:]
x y z
<dbl> <dbl> <int>
1 1 2 1
2 1 2 2
The only thing is that I could not figure out how to make this work with filter(across(c("x", "y"))

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

Passing string as an argument in R

On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)

Sparklyr : force allocation to use functions such as n_distinct, match [duplicate]

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

fixing incompatible types error in R using dplyr/mutate

I'm trying to use the tidyverse/dplyr package in R to work with data including vectorized calls to an online API (from Altmetric) to add rows using mutate.
The smallest code I can create that reproduces the error is that below. I get the error "Error: incompatible types, expecting a numeric vector"
library(tidyverse)
library(jsonlite)
fromJSON_wrapper <- function(x,y) {
fromJSON(x)[[c(y)]]
}
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% rowwise() %>% mutate(score = fromJSON_wrapper(url,"score"))
The code for extracting a single score below works, whether just using the wrapper or on a one row tibble and I'm not sure why my code isn't working.
fromJSON_wrapper("https://api.altmetric.com/v1/doi/10.1007/s13721-015-0095-0")
extracted <- toy[1,] %>% rowwise() %>% mutate(score = fromJSON_wrapper(url, "score"))
Any suggestions would be appreciated.
It's simpler to just iterate over the vector of URLs and extract what you need. purrr::map_dbl makes this simple, though sapply would work fine, too.
library(tidyverse)
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% mutate(score = map_dbl(url, ~jsonlite::fromJSON(.x)$score))
extracted %>% select(doi, score)
#> # A tibble: 3 × 2
#> doi score
#> <chr> <dbl>
#> 1 10.1002/anie.201500251 0.25
#> 2 10.1080/19443994.2015.1005695 1.00
#> 3 10.1007/s13721-015-0095-0 1.00

Resources