dplyr works differently for dataframe and database - how to resolve? - r

Apparently, dplyr requires different implementation for dataframes and databases.
This is not the first time I encounter this.
The example code is below. The purpose of the code is to remove Inf values from the database.
library(RSQLite)
library(DBI)
# dataframe
data <- data.frame(x = c(rep(1,2), rep(Inf, 3), rep(1, 5)),
y = c(rep(2,5), rep(Inf, 5)),
z = 1:10)
# database
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(db, data, name = "data", overwrite = TRUE)
data_db <- tbl(db, "data")
# WORKS for dataframe:
data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# DOES NOT WORK for database
data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
The last line returns an error:
Error in eval_bare(call, env) : object 'x' not found
Is there a good overview of the differences required in implementations of dplyr for dataframes vs. databases?
Please help to have the code above work for the database case.
Thank you

Looking through dbplyr's Function translation vignette, it doesn't look like is.finite() is mentioned, and indeed we can verify that it doesn't know how to translate is.infinite to the SQL command ISFINITE().
dbplyr::translate_sql(is.infinite(x))
# <SQL> is.infinite(`x`)
In this case, as per the Writing SQL with dbplyr vignette, you can use the sql command, something like:
## write idea, but probably won't work
data_db %>% filter(across(c("x", "y"), sql(NOT ISFINITE(.))))
Though that admittedly looks like it has a low chance of working because of the . in the SQL. I'm not sure how well across() (or the older filter_at) dplyr functions play with the sql() function. You may need to write out the columns you want to filter on:
data_db %>% filter(sql(NOT ISFINITE(x)), sql(NOT ISFINITE(y)))
Is there a good overview of the differences required in implementations of dplyr for dataframes vs. databases?
The vignettes mentioned above are both good reading on this.

You could turn it into a proper data.frame or tibble:
data_db <- tbl(db, "data") %>% as_tibble
# WORKS for dataframe:
data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# DOES NOT WORK for database
data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
Output:
> data %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
x y z
1 1 2 1
2 1 2 2
> data_db %>% filter_at(c("x", "y"), all_vars(base::is.finite(.)))
# A tibble: 2 x 3
x y z
<dbl> <dbl> <int>
1 1 2 1
2 1 2 2
You could use as.data.frame instead of as_tibble if that is more to your liking.

I have figured it out (almost).
The problem with SQLite is that it does not have function ISFINITE causing Error: no such function: ISFINITE (as I posted in the comment to Gregor's answer).
My solution is to use basic math logic to define Inf:
> data_db %>% filter(sql("x * x != x or x = 0 or x = 1"),
+ sql("y * y != y or y = 0 or y = 1"))
# Source: lazy query [?? x 3]
# Database: sqlite 3.33.0 [:memory:]
x y z
<dbl> <dbl> <int>
1 1 2 1
2 1 2 2
The only thing is that I could not figure out how to make this work with filter(across(c("x", "y"))

Related

Why does add_column assign a letter to the data?

I tried reading through R's documentation on the add_column function, but I'm a little confused as to the examples it provides. See below:
# add_column ---------------------------------
df <- tibble(x = 1:3, y = 3:1)
df %>% add_column(z = -1:1, w = 0)
df %>% add_column(z = -1:1, .before = "y")
# You can't overwrite existing columns
try(df %>% add_column(x = 4:6))
# You can't create new observations
try(df %>% add_column(z = 1:5))
What is the purpose of these letters that are being assigned a range? Eg:
z = 1:5
My understanding from the documentation is that add_column() takes in a dataframe and appends it in position based on the .before and .after arguments defaulting to the end of the dataframe.
I'm a little confused here. There is also a "..." argument that takes in Name-value pairs. Is that what I'm seeing with "z = 1:5"? What is the functional purpose of this?
data.frame columns always have a name in R, no exception.
Since add_column adds new columns, you need to specify names for these columns.
… well, technically you don’t need to. The following works:
df %>% add_column(1 : 3)
But add_column auto-generates the column name based on the expression you pass it, and you might not like the result (in this case, it’s literally 1:3, which isn’t a convenient name to work with).
Conversely, the following also works and is perfectly sensible:
z = 1 : 3
df %>% add_column(z)
Result:
# A tibble: 3 x 3
x y z
<int> <int> <int>
1 1 3 1
2 2 2 2
3 3 1 3

Sparklyr : force allocation to use functions such as n_distinct, match [duplicate]

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

count number of unique elements in each columns with dplyr in sparklyr

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

fixing incompatible types error in R using dplyr/mutate

I'm trying to use the tidyverse/dplyr package in R to work with data including vectorized calls to an online API (from Altmetric) to add rows using mutate.
The smallest code I can create that reproduces the error is that below. I get the error "Error: incompatible types, expecting a numeric vector"
library(tidyverse)
library(jsonlite)
fromJSON_wrapper <- function(x,y) {
fromJSON(x)[[c(y)]]
}
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% rowwise() %>% mutate(score = fromJSON_wrapper(url,"score"))
The code for extracting a single score below works, whether just using the wrapper or on a one row tibble and I'm not sure why my code isn't working.
fromJSON_wrapper("https://api.altmetric.com/v1/doi/10.1007/s13721-015-0095-0")
extracted <- toy[1,] %>% rowwise() %>% mutate(score = fromJSON_wrapper(url, "score"))
Any suggestions would be appreciated.
It's simpler to just iterate over the vector of URLs and extract what you need. purrr::map_dbl makes this simple, though sapply would work fine, too.
library(tidyverse)
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% mutate(score = map_dbl(url, ~jsonlite::fromJSON(.x)$score))
extracted %>% select(doi, score)
#> # A tibble: 3 × 2
#> doi score
#> <chr> <dbl>
#> 1 10.1002/anie.201500251 0.25
#> 2 10.1080/19443994.2015.1005695 1.00
#> 3 10.1007/s13721-015-0095-0 1.00

Changing column data type to factor with sparklyr

I am pretty new to Spark and am currently using it using the R API through sparkly package. I created a Spark data frame from hive query. The data types are not specified correctly in the source table and I'm trying to reset the data type by leveraging the functions from dplyr package. Below is the code I tried:
prod_dev <- sdf_load_table(...)
num_var <- c("var1", "var2"....)
cat_var <- c("var_a","var_b", ...)
pos1 <- which(colnames(prod_dev) %in% num_var)
pos2 <- which(colnames(prod_dev) %in% cat_var)
prod_model_tbl <- prod_dev %>%
mutate(age = 2016- as.numeric(substr(dob_yyyymmdd,1,4))) %>%
mutate(msa_fg = ifelse(is.na(msacode2000), 0, 1)) %>%
mutate(csa_fg = ifelse(is.na(csacode), 0, 1)) %>%
mutate_each(funs(factor), pos2) %>%
mutate_each(funs(as.numeric), pos1)
The code will work if prod_dev is a R data frame. But using it on a Spark Data frame does not seem to produce the correct result:
> head(prod_model_tbl)
Source: query [?? x 99]
Database: spark connection master=yarn-client app=sparklyr_test local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function FACTOR; line 97 pos 2248 at org.apache.spark.sql.hive.HiveFunctionRegistry....
Can someone please advise how to make the desired changes to the Spark Data Frame?
In general you can use standard R generic functions for type casting. For example:
df <- data.frame(x=c(1, NA), y=c("-1", "2"))
copy_to(sc, df, "df", overwrite=TRUE) %>%
mutate(x_char = as.character(x)) %>%
mutate(y_numeric = as.numeric(y))
Source: query [2 x 4]
Database: spark connection master=...
x y x_char y_numeric
<dbl> <chr> <chr> <dbl>
1 1 -1 1.0 -1
2 NaN 2 <NA> 2
The problem is Spark doesn't provide any direct equivalent of R factor.
In Spark SQL we use double type and column metadata to represent categorical variables and ML Transformers, which are not a part of SQL, for encoding. Therefore there is no place for factor / as.factor. SparkR provides some automatic conversions when working with ML but I am not sure if there is similar mechanism in sparklyr (the closest thing I am aware of is ml_create_dummy_variables).

Resources