R/arrow summarizing on variable columns - r

I have a large-ish parquet file I'm referencing via arrow::open_dataset. I'd like to get the max value of one or more of the columns, where I don't know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across), but I can't get it to work.
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
summarize(across(sym(vars), ~ max(.))) %>%
collect()
# # A tibble: 1 x 1
# a
# <dbl>
# 1 9
But when vars is length 2 or more, I assume I need to be using syms or similar, but that fails with
vars <- c("a", "b")
open_dataset("quux.parquet") %>%
summarize(across(all_of(syms(vars)), ~ max(.))) %>%
collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.
How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?
While I suspect that the correct answer in dplyr will be some form of syms, and then whether or not arrow supports that is the next question. I'm not tied to the dplyr mechanisms, if there's a method using ds$NewScan() or similar, I'm amenable.

Is this the kind of thing you're after - using tidyselect's all_of function?
library(arrow)
library(dplyr)
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a", "d")
open_dataset("quux.parquet") %>%
summarize(across(all_of(vars), ~ max(.))) %>%
collect()
#> # A tibble: 1 × 2
#> a d
#> <dbl> <chr>
#> 1 9 r
See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.

Related

R double row filter by string

I'm cleaning a dataset that doesn't yet have column names (so I'm working with indexes) and I'm trying to filter two columns of a df by piping the results of the first filter into the second and don't understand why the below doesn't work:
stripcols <- c("","Total+")
df <- df %>%
filter(!df[,1] %in% stripcols) %>%
filter(!df[,2] %in% stripcols)
Running this results in:
Error in filter_impl(.data, quo) : Result must have length 46, not 58
This is easily worked around by running the filter twice, but I don't understand why this didn't work.
I'm also curious as to whether there is a way to do this with one filter command that is applied on both columns rather than two.
The source of the error is that you are always comparing against nrow(df) rows regardless of how many rows hit the second filter. For instance:
dat <- data.frame(a=1:10)
dat %>% filter(a > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
The way you're writing it, you're doing
dat %>% filter(dat[,1] > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
For this first call, the number of rows that go into filter is 10, and the number of rows being compared inside filter is also 10. However, if you were to do:
dat %>% filter(dat[,1] > 5) %>% filter(dat[,1] > 7)
# Error in filter_impl(.data, quo) : Result must have length 5, not 10
this fails because the number of rows going into the second filter is only 5 not 10, though we are giving the filter command 10 comparisons by using dat[,1].
(N.B.: many comments about names are perfectly appropriate, but let's continue with the theme of using column indices.)
The first trick is to give each filter only as many comparisons as the data coming in. Another way to say this is to do comparisons on the state of the data at that point in time. magrittr (and therefore dplyr) do this with the . placeholder. The dot is always able to be inferred (defaulting to the first argument of the RHS function, the function after %>%), but some feel that being explicit is better. For instance, this is legal:
mtcars %>%
group_by(cyl) %>%
tally()
# # A tibble: 3 x 2
# cyl n
# <dbl> <int>
# 1 4 11
# 2 6 7
# 3 8 14
but an explicit equivalent pipe is this:
mtcars %>%
group_by(., cyl) %>%
tally(.)
If the first argument to the function is not the frame itself, then the %>% inferred way will fail:
mtcars %>%
xtabs(~ cyl + vs)
# Error in as.data.frame.default(data, optional = TRUE) :
# cannot coerce class '"formula"' to a data.frame
(Because it is effectively calling xtabs(., ~cyl + vs), and without named arguments then xtabs assumed the first argument to be a formula.)
so we must be explicit in these situations:
mtcars %>%
xtabs(~ cyl + vs, data = .)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
(contrived example, granted). One could also do mtcars %>% xtabs(formula=~cyl+vs), but my points stands.
So to adapt your code, I would expect this to work:
df %>%
filter(!.[,1] %in% stripcols) %>%
filter(!.[,2] %in% stripcols)
I think I'd prefer the [[ approach (partly because I know that tbl_df and data.frame deal with [,1] slightly differently ... and though it works with it, I still prefer the explicitness of [[):
df %>%
filter(!.[[1]] %in% stripcols) %>%
filter(!.[[2]] %in% stripcols)
which should work. Of course, combining works just fine, too:
df %>%
filter(!.[[1]] %in% stripcols, !.[[2]] %in% stripcols)

Sparklyr : force allocation to use functions such as n_distinct, match [duplicate]

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

count number of unique elements in each columns with dplyr in sparklyr

I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))

How to parametrize function calls in dplyr 0.7?

The release of dplyr 0.7 includes a major overhaul of programming with dplyr. I read this document carefully, and I am trying to understand how it will impact my use of dplyr.
Here is a common idiom I use when building reporting and aggregation functions with dplyr:
my_report <- function(data, grouping_vars) {
data %>%
group_by_(.dots=grouping_vars) %>%
summarize(x_mean=mean(x), x_median=median(x), ...)
}
Here, grouping_vars is a vector of strings.
I like this idiom because I can pass in string vectors from other places, say a file or a Shiny app's reactive UI, but it's also not too bad for interactive work either.
However, in the new programming with dplyr vignette, I see no examples of how something like this can be done with the new dplyr. I only see examples of how passing strings is no longer the correct approach, and I have to use quosures instead.
I'm happy to adopt quosures, but how exactly do I get from strings to the quosures expected by dplyr here? It doesn't seem feasible to expect the entire R ecosystem to provide quosures to dplyr - lots of times we're going to get strings and they'll have to be converted.
Here is an example showing what you're now supposed to do, and how my old idiom doesn't work:
library(dplyr)
grouping_vars <- quo(am)
mtcars %>%
group_by(!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
#> # A tibble: 2 × 2
#> am mean_cyl
#> <dbl> <dbl>
#> 1 0 6.947368
#> 2 1 5.076923
grouping_vars <- "am"
mtcars %>%
group_by(!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
#> # A tibble: 1 × 2
#> `"am"` mean_cyl
#> <chr> <dbl>
#> 1 am 6.1875
dplyr will have a specialized group_by function group_by_at to deal with multiple grouping variables. It would be much easier to use the new member of the _at family:
# using the pre-release 0.6.0
cols <- c("am","gear")
mtcars %>%
group_by_at(.vars = cols) %>%
summarise(mean_cyl=mean(cyl))
# Source: local data frame [4 x 3]
# Groups: am [?]
#
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000
The .vars argument accepts both character/numeric vector or column names generated by vars:
.vars
A list of columns generated by vars(), or a character vector of
column names, or a numeric vector of column positions.
Here's the quick and dirty reference I wrote for myself.
# install.packages("rlang")
library(tidyverse)
dat <- data.frame(cat = sample(LETTERS[1:2], 50, replace = TRUE),
cat2 = sample(LETTERS[3:4], 50, replace = TRUE),
value = rnorm(50))
Representing column names with strings
Convert strings to symbol objects using rlang::sym and rlang::syms.
summ_var <- "value"
group_vars <- c("cat", "cat2")
summ_sym <- rlang::sym(summ_var) # capture a single symbol
group_syms <- rlang::syms(group_vars) # creates list of symbols
dat %>%
group_by(!!!group_syms) %>% # splice list of symbols into a function call
summarize(summ = sum(!!summ_sym)) # slice single symbol into call
If you use !! or !!! outside of dplyr functions you will get an error.
The usage of rlang::sym and rlang::syms is identical inside functions.
summarize_by <- function(df, summ_var, group_vars) {
summ_sym <- rlang::sym(summ_var)
group_syms <- rlang::syms(group_vars)
df %>%
group_by(!!!group_syms) %>%
summarize(summ = sum(!!summ_sym))
}
We can then call summarize_by with string arguments.
summarize_by(dat, "value", c("cat", "cat2"))
Using non-standard evaluation for column/variable names
summ_quo <- quo(value) # capture a single variable for NSE
group_quos <- quos(cat, cat2) # capture list of variables for NSE
dat %>%
group_by(!!!group_quos) %>% # use !!! with both quos and rlang::syms
summarize(summ = sum(!!summ_quo)) # use !! both quo and rlang::sym
Inside functions use enquo rather than quo. quos is okay though!?
summarize_by <- function(df, summ_var, ...) {
summ_quo <- enquo(summ_var) # can only capture a single value!
group_quos <- quos(...) # captures multiple values, also inside functions!?
df %>%
group_by(!!!group_quos) %>%
summarize(summ = sum(!!summ_quo))
}
And then our function call is
summarize_by(dat, value, cat, cat2)
If you want to group by possibly more than one column, you can use quos
grouping_vars <- quos(am, gear)
mtcars %>%
group_by(!!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000
Right now, it doesn't seem like there's a great way to turn strings into quos. Here's one way that does work though
cols <- c("am","gear")
grouping_vars <- rlang::parse_quosures(paste(cols, collapse=";"))
mtcars %>%
group_by(!!!grouping_vars) %>%
summarise(mean_cyl=mean(cyl))
# am gear mean_cyl
# <dbl> <dbl> <dbl>
# 1 0 3 7.466667
# 2 0 4 5.000000
# 3 1 4 4.500000
# 4 1 5 6.000000

how to drop columns by passing variable name with dplyr?

I have a df as follows:
a <- data_frame(keep=c("hello", "world"),drop = c("nice", "work"))
a
Source: local data frame [2 x 2]
keep drop
(chr) (chr)
1 hello nice
2 world work
I can use a %>% select(-drop) to drop the column without problem. however, if I want to pass a variable to present drop column, then it returns error.
name <- "drop"
a %>% select(-(name))
Error in -(name) : invalid argument to unary operator
You can use one_of to find the column positions and then use - to drop it, select(-one_of(name)), if you check ?select, the usage is documented in the Drop variable section in the Examples:
name <- "drop"
a %>% select(-one_of(name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world
Or with select_, you need to paste - with the column names to drop them and pass the pasted column names to the .dots parameter if there are more than one column to be dropped:
name <- "drop"
a %>% select_(.dots = paste("-", name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world
You can simple use
a <- data_frame(keep=c("hello", "world"),drop = c("nice", "work"))
select(a, -starts_with('drop'))
# Source: local data frame [2 x 1]
#
# keep
# (chr)
# 1 hello
# 2 world
you have to search for some previously written solutions too. Please read the document here Select/rename variables by name.DPLYR
I hope that does the job for you :)
#Psidom thanx for updating your answer.. but I will request upvoters for vote for me too as I recently became an active user and still am to get basic privileges on stackoverflow.
We can use select with setdiff
a %>%
select_(setdiff(names(.), name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world
A few more possibilities:
name <- "drop"
a %>% `[<-`(name, value=NULL)
a %>% magrittr::inset(name,value=NULL)
a %>% purrr::modify_at(name,~NULL)
I could only get these solutions to work by first ungrouping the data using ungroup:
df <- df %>% ungroup %>% select(-hello)
Notice no quotation marks on the column name you want to drop (hello). Also, to remove multiple columns, just place a , after hello and add the second column.
From the ?select_ help: "dplyr used to offer twin versions of each verb suffixed with an underscore. ... However, dplyr now uses tidy evaluation semantics. ... Thus, the underscored versions are now superfluous."
The example given in vignette("programming"), similar to #Psidom's answer, is:
name < "drop"
a %>% select(!all_of(name))
Alternatively, this one could create a function to drop columns, so that drop does not need quoting:
drop_columns <- function(data, cols) {
data %>% select(!{{cols}})
}
drop_columns(a, drop)

Resources