pass arguments when pipe is called inside a function - r

I wanted to write a function that is the composition of two functions using the pipe operator %>%, so that, for instance, the following are equivalent (imagine I call it %O%):
iris[1:4] %>% lapply(FUN = function(x) hist(sample(x))
iris[1:4] %>% lapply(FUN = sample %O% hist)
I want it in this direction (not hist %O% sample) because it would follow the same logic as %>%.
I've come to something like that:
'%O%' <- function(lhs1, rhs1){
return(
function(x){
return(x %>% lhs1 %>% rhs1)
})
}
However, it raises errors when I try
iris[1:4] %>% lapply(FUN = sample(size = 100, replace = TRUE) %O% hist)
What should I do to allow the parameters to be understood by %>% inside the %O% function? Is it an eval and quote problem? I also don't really understand how %>% is able to read arguments in lhs and rhs.
Any help will be appreciated. Thanks a lot!

Use a magrittr functional sequence instead?
f <- . %>% sample() %>% hist()
iris[1:4] %>% lapply(f)
Or just
iris[1:4] %>% lapply(. %>% sample() %>% hist())

Here are some possibilities:
iris[1:4] %>% lapply(FUN = function(x) x %>% sample %>% hist)
library(gsubfn)
iris[1:4] %>% fn$lapply(FUN = x ~ x %>% sample %>% hist)
library(functional)
iris[1:4] %>% lapply(FUN = Compose(sample, hist))
library(functional)
`%c%` <- Compose
iris[1:4] %>% lapply(FUN = sample %c% hist)
Update: Added additional solutions.

Related

Strange(?) behaviour of `summary()` inside a pipe

Me and a couple of colleagues are wondering why
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary()
or
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary() %>% `+`(., 0)
do not work (summary is not printed), while
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% {print(summary(.))}
does?
There's no reason for summary to print its output any more than rnorm or matrix or any of the other functions in the pipeline. Printing is generally suppressed when there is assignment <-.
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary()
# no printing
N <- 5
# no printing
M <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4])
# no printing
M %>% summary
summary(M)
# prints
sM <- M %>% summary
sM <- summary(M)
# no printing
# assignment prevents implicit printing.
An explicit print call creates a side-effect (printing), which is what you want to happen.
The tee pipe %T>% doesn't create any side-effects (like printing), it just returns the LHS so that if you %T>% into a function that does create side-effects but does not return it's input (like plot()), you can get those side effects while still piping the input to another step.
Note that print does return its argument (invisibly), so you don't actually need the tee pipe %T>% with print.

Error with tidy select when feeding column names into purrr::map for user function

I have a long function that uses a dataframe column name as an input and am trying to apply it to several different column names without a new line of code each time. I am having issues with tidyselect within the function called by map. I believe the issue is related to defusing, but I cannot figure it out. A toy example using mtcars data is below.
This works correctly with map:
library(tidyverse)
sum_dplyr <- function(df, x) {
res <- df %>% summarise(mean = mean({{x}}, na.rm = TRUE))
return(res)
}
sum_dplyr(mtcars, disp)
map(names(mtcars), ~ sum_dplyr(mtcars, mtcars[[.]])) # all columns -> works fine
While this gives the error "Must subset columns with a valid subscript vector" when feeding the function through map:
library(tidyverse)
sel_dplyr <- function(df, x) {
res <- df %>% dplyr::select({{x}})
return(res)
}
sel_dplyr(mtcars, disp) # ok
map(names(mtcars), ~ sel_dplyr(mtcars, mtcars[[.]])) # all columns -> error
What am I missing here ? Many thanks !
It may be better to correct the function to make sure that it takes both unquoted and quoted. With map, we are passing a character string. So, instead of {{}}, can use ensym with !!
sum_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
summarise(mean = mean(!!x, na.rm = TRUE))
return(res)
}
Similarly for sel_dplyr
sel_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
dplyr::select(!! x)
return(res)
}
and then test as
library(purrr)
library(dplyr)
map(names(mtcars), ~ sel_dplyr(mtcars, !!.x))
sel_dplyr(mtcars, carb)

FUN == 'x' does not work, how to go around it in R

I am trying to write a function that uses some other function FUN as an argument - in this case, I want to (among other things), alter what to do if I set FUN = match0.
library(dplyr)
library(purrr)
f <- function(df, pair, FUN, ...){
df1 <- df %>%
group_split()
w <- df1 %>%
map(~ .x %>%
nrow() %>%
seq())
x <- map2(w, df1, ~map(.x, mean, df = .y))
y <- map(x, unlist)
l <- map2(y, df1, ~map(.x, function(x, df = .y){
if(deparse(substitute(FUN)) == 'match0'){
out <- x
} else{
out <- df[x, pair]}
return(out)
}
)) %>% unlist()
df <- bind_rows(df1) %>% bind_cols(index = l)
return(df)
}
If I run, for instance:
a <- data.frame(n = c(15,20,15,20,15,20)) %>% group_by(n)
x <- f(a, pair = 'pairs0', FUN = match0)
I get Column 'pairs0' doesn't exist.
This would be the case if, in fact, the conditional statement evaluated to FALSE. How can I change this?
To be honest, I'm not quite sure to use deparse, substitute and the like, I've just tried to follow some other posts. FWIW, I thought it would work because if I test deparse(substitute(match0)) == 'match0', I get TRUE.
Any help?

Analysis by row with multiple functions in dplyr

I'm trying to parse the cases (rows) of a data.frame with dplyr, but to no avail. I created two functions for this:
f1 <- function(x) {
c(s = sum(x),
m = mean(x),
v = var(x))
}
f2 <- function(x) {
apply(x, 1, f1)
}
My data.frame (data_1):
for (i in 1:6) {
assign(paste('var', i, sep = '_'),
runif(30, 20, 100))
}
data_1 <- do.call(
cbind.data.frame,
mget(ls(pattern = '*v'))
)
Using dplyr functions:
library(dplyr)
data_1 %>%
mutate_at(.vars = vars (starts_with('v')),
.funs = funs(.= f2))
data_1 %>%
mutate_if(is.numeric, .funs = funs(.= f2))
Error in mutate_impl(.data, dots) : Evaluation error: dim(X) must have a positive length.
Since the analysis is done in the rows, and I have three functions (sum, mean, and variance), the expected return is three columns.
In fact, although not deprecated, rowwise() does not play well with other grouping and summary functions, so is best avoided in dplyr. A useful alternative can be to group by row number. Here is a solution to the above using this approach.
colNames <- syms(paste0("var_", 1:6))
data_1 %>%
group_by (row_number()) %>%
summarize(dataMean = mean(!!!colNames),
dataSum = sum(!!!colNames))

Discretize variables using SparkR

I want to discretize a variable using R, preferably SparkR, so that the desired results would be like the following.
library(arules)
mtcars %>% mutate(bins = discretize(x = mpg, method = "interval", breaks = 4))
I checked the document but could see the non-R solutions only at https://spark.apache.org/docs/2.2.0/ml-features.html#bucketizer.
Please advise.
In general SparkR provides a very limited subset of ML functions (a full support is planned for Spark 3.0, as a separate R package SPARK-24359 SPIP: ML Pipelines in R, though simple discretization like this, can be performed using CASE ... WHEN ... statements.
First compute the breaks:
df <- createDataFrame(mtcars)
min_max <- df %>%
select(min(df$mpg), max(df$mpg)) %>%
collect() %>%
unlist()
n <- 4
breaks <- seq(min_max[[1]], min_max[[2]], length.out = n)
Then generate expression:
bucket <- purrr::map2(
breaks[-n], breaks[-1],
function(x, y) between(column("mpg"), c(x, y))) %>%
purrr::reduce2(
., seq(length(.)),
function(acc, x, y) otherwise(when(x, y), acc),
.init = lit(NA))
df %>% withColumn("bucket", bucket)

Resources