Pass multiple column names in function to dplyr::distinct() with Spark - r

I want to specify an unknown number of column names in a function that will use dplyr::distinct(). My current attempt is:
myFunction <- function(table, id) {
table %>%
dplyr::distinct(.data[[id]])
}
I'm trying the above [.data[[id]]] because the data-masking section of this dplyr blog states:
When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).
and the documentation for dplyr::distinct() says about its second argument:
<data-masking> Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
Spark
More specifically, I'm trying to use this function with Spark.
sc <- sparklyr::spark_connect(local = "master")
mtcars_tbl <- sparklyr::copy_to(sc, mtcars, "mtcars_spark")
##### desired return
mtcars_tbl %>% dplyr::distinct(cyl, gear)
# Source: spark<?> [?? x 2]
cyl gear
<dbl> <dbl>
1 6 4
2 4 4
3 6 3
4 8 3
5 4 3
6 4 5
7 8 5
8 6 5
##### myFunction fails
id = c("cyl", "gear")
myFunction(mtcars_tbl, id)
Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred.
Following this comment, I have other failed attempts:
myFunction <- function(table, id) {
table %>%
dplyr::distinct(.dots = id)
}
myFunction(mtcars_tbl, id)
# Source: spark<?> [?? x 1]
.dots
<list>
1 <named list [2]>
#####
myFunction <- function(table, id) {
table %>%
dplyr::distinct_(id)
}
myFunction(mtcars_tbl, id)
Error in UseMethod("distinct_") :
no applicable method for 'distinct_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Distinct applies to all columns of a table at once. Consider an example table:
A B
1 4
1 4
2 3
2 3
3 3
3 5
It is not clear what applying distinct to only column A, but not column B should return. The following example is clearly not a good choice because it breaks the relationship between columns A and B. For example, there is no (A = 2, B = 4) row in the original dataset.
A B
1 4
2 4
3 3
3
3
5
Hence the best approach is to select only those columns you want first, and then take the distinct. Something more like:
myFunction <- function(table, id) {
table %>%
dplyr::select(dplyr::all_of(id)) %>%
dplyr::distinct()
}

Related

How to define a variable to record the number of processed rows when using R, dplyr and rowwise?

I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

How do I modify part of a column in a Spark data frame

I am trying to modify part of a column from a Spark data frame. The row selection is based on the vector (in R env) ID.X. The replacement is another vector (in R env) Role.
I have tried the following:
> sdf.bigset %>% filter(`_id` %in% ID.X) %>%
mutate(data_role= Role )
It crashes my R
And the following
> head(DT.XSamples)
_id Role
1: 5996e9e12a2aa6315127ed0e Training
2: 5996e9e12a2aa6315127ed0f Training
3: 5996e9e12a2aa6315127ed10 Training
> setkey(DT.XSamples,`_id`)
> Lookup.XyDaRo <- function(x){
unlist(DT.XSamples[x,Role])
}
> sdf.bigset %>% filter(`_id` %in% ID.X) %>% rowwise()%>%
mutate(data_role= Lookup.XyDaRo(`_id`) )
As well as the following
> Fn.lookup.XyDaRo <- function(id,role){
ifelse(is.na(role), unlist(DT.XSamples[id,Role] ), role )
}
> sdf.bigset%>% rowwise() %>%
mutate(data_role= Fn.lookup.XyDaRo(`_id`,data_role))
Then I get for both cases
Error: is.data.frame(data) is not TRUE
sdf.bigset is a Spark data frame.
DT.XSamples is a data table living in R.
Any idea what I am doing wrong, or how it should be properly done?
Let's say sdf.bigset looks like this:
sdf.bigset <- copy_to(sc, data.frame(`id` = 1:10, data_role = "Unknown"))
adn DT.XSamples is defined as:
XSamples <- data.frame(
`id` = c(3, 5, 9), role = c("Training", "Dev", "Secret")
)
Convert DT.XSamples to Spark:
sdf.XSamples <- copy_to(sc, XSamples)
left_join and coalesce:
left_join(sdf.bigset, sdf.XSamples, by="id") %>%
mutate(data_role = coalesce(role, data_role))
# Source: lazy query [?? x 3]
# Database: spark_connection
id data_role role
<int> <chr> <chr>
1 1 Unknown NA
2 2 Unknown NA
3 3 Training Training
4 4 Unknown NA
5 5 Dev Dev
6 6 Unknown NA
7 7 Unknown NA
8 8 Unknown NA
9 9 Secret Secret
10 10 Unknown NA
Finally drop role with negative select.
Regarding your code:
Vector replacements won't work because Spark DataFrame is more relation (in relational algebra sense) not DataFrame, and in general order is not defined, therefore operations like this are not implemented.
DT variant won't work because you cannot execute plain R code, with exception to (incredibly inefficient) spark_apply.

r dplyr sample_frac using seed in data

I have a grouped data frame, in which the grouping variable is SEED. I want to take the groups defined by the values of SEED, set the seed to the value of SEED for each group, and then shuffle the rows of each group using dplyr::sample_frac. However, I cannot replicate my results, which indicates that the seed isn't being set correctly.
To do this in a dplyr-ish way, I wrote the following function:
> library(dplyr)
> ss_sampleseed <- function(df, seed.){
> set.seed(df$seed.)
> sample_frac(df, 1)
> }
I then use this function on my data:
> dg <- structure(list(Gene = c("CAMK1", "ARPC4", "CIDEC", "CAMK1", "ARPC4",
> "CIDEC"), GENESEED = c(1, 1, 1, 2, 2, 2)), class = c("tbl_df",
> "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("Gene",
> "GENESEED"))
> dg2 <- dg %>%
> group_by(GENESEED) %>%
> ss_sampleseed(GENESEED)
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CIDEC 1
3 CAMK1 1
4 CIDEC 2
5 ARPC4 2
6 CAMK1 2
However, when I repeat the above code, I cannot replicate my results.
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CAMK1 1
3 CIDEC 1
4 CAMK1 2
5 ARPC4 2
6 CIDEC 2
The problem here is that dollar sign will not substitute for the parameter you are passing. See this minimal example:
df <- data.frame(x = "x", GENESEED = "GENESEED")
h <- function(df,x){
df$x
}
h(df, GENESEED)
[1] x
Levels: x
See that h returns x even though you asked for GENESEED. So your function is actually trying to get df$seed which does not exist so it returns NULL.
But there is another problem. Even correcting this and passing directly the seed, it seems that it would not work as you want, because, if you look at the code of sample_frac, dplyr will eventually run the following line:
sampled <- lapply(index, sample_group, frac = TRUE, tbl = tbl,
size = size, replace = replace, weight = weight, .env = .env)
Notice that it runs a lapply after you set the seed, so you will not have defined a different seed for each group according to GENESEED as you wanted.
Taking this into consideration, I came up with this solution, using sample.int and do:
ss_sampleseed <- function(x){
set.seed(unique(x$GENESEED))
x[sample.int(nrow(x)), ]
}
dg %>% group_by(GENESEED) %>% do(ss_sampleseed(.))
This seems to be working as you want.
I think the main thing going here is the use of $ coding like you are inside your function. I certainly had to learn this the hard way. See also:
library(fortunes)
fortune(312)
fortune(343)
Take the simple function from #Carlos Cinelli and try to use it outside of any dplyr functions.
h = function(df, seed.){
df$seed.
}
h(dg, GENESEED)
NULL
It's those darn dollar signs. Now change the function to use [[ instead.
h2 = function(df, seed.){
df[[seed.]]
}
h2(dg, "GENESEED")
[1] 1 1 1 2 2 2
That's more like it, although you did have to put quotes around the variable name in the function.
So where does that leave your original function? You can go two ways. First, you could just change to [[ and use quotes around the variable name in your function.
ss_sampleseed = function(df, seed.){
set.seed(df[[seed.]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed("GENESEED")
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
The other option is to use deparse(substitute(seed.)) inside your function to allow for non-standard evaluation. You'll still need [[, though.
ss_sampleseed2 = function(df, seed.){
set.seed(df[[deparse(substitute(seed.))]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed2(GENESEED)
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
I get replicated results with either of these, although I didn't check if the seed is specifically set to what you want it to be.

Resources