How can I use dplyr/magrittr's pipe inside functions in R? - r

I'm trying to write a function which takes as argument a dataframe and the name of the function. When I try to write the function with the standard R syntax, I can get the good result using eval and substitute as recommanded by #hadley in http://adv-r.had.co.nz/Computing-on-the-language.html
> df <- data.frame(y = 1:10)
> f <- function(data, x) {
+ out <- mean(eval(expr = substitute(x), envir = data))
+ return(out)
+ }
> f(data = df, x = y)
[1] 5.5
Now, when I try to write the same function using the %>% operator, it doesn't work :
> df <- data.frame(y = 1:10)
> f <- function(data, x) {
+ data %>%
+ eval(expr = substitute(x), envir = .) %>%
+ mean()
+ }
> f(data = df, x = y)
Show Traceback
Rerun with Debug
Error in eval(expr, envir, enclos) : objet 'y' introuvable
>
How can I using the combine the piping operator with the use of eval and substitute ? It's seems really tricky for me.

A workaround would be
f <- function(data, x) {
v <- substitute(x)
data %>%
eval(expr = v, envir = .) %>%
mean()
}
The problem is that the pipe functions (%>%) are creating another level of closure which interferes with the evaluation of substitute(x). You can see the difference with this example
df <- data.frame(y = 1:10)
f1 <- function(data, x) {
print(environment())
eval(expr = environment(), envir = data)
}
f2 <- function(data, x) {
print(environment())
data %>%
eval(expr = environment(), envir = .)
}
f1(data = df, x = y)
# <environment: 0x0000000006388638>
# <environment: 0x0000000006388638>
f2(data = df, x = y)
# <environment: 0x000000000638a4a8>
# <environment: 0x0000000005f91ae0>
Notice how the environments differ in the matrittr version. You want to take care of substitute stuff as soon as possible when mucking about with non-standard evaluation.
I hope your use case is a bit more complex than your example, because it seems like
mean(df$y)
would be a much easier bit of code to read.

I've been trying to understand my problem.
First, I've written what I want with the summarise() function :
> library(dplyr)
> df <- data.frame(y = 1:10)
> summarise_(.data = df, mean = ~mean(y))
mean
1 5.5
Then I try to program my own function. I've found a solution which seems to work with the lazyeval package in this post. I use the lazy() and the interp() functions to write what I want.
The first possibility is here :
> library(lazyeval)
> f <- function(data, col) {
+ col <- lazy(col)
+ inter <- interp(~mean(x), x = col)
+ summarise_(.data = data, mean = inter)
+ }
> f(data = df, col = y)
mean
1 5.5
I can also use pipes :
> f <- function(data, col) {
+ col <- lazy(col)
+ inter <- interp(~mean(x), x = col)
+ data %>%
+ summarise_(.data = ., mean = inter)
+ }
>
> f(data = df, col = y)
mean
1 5.5

I would not use eval and substitute.
What follows is a simplified version of this great post suited to your question.
df <- data.frame(y = 1:10)
f <- function(data, x) {
x <- enquo(x)
df %>% summarise(mean = mean(!!x))
}
f(data = df, x = y)
There are two things happening here:
Tranforming the column name with enquo()
Prefixing the column with !!
Please see refer to the link for a more complicated example.

Related

Function within a function: cramer function "Error in loglin"

I tried to embed the cramer function from sjstats package. Although the function works perfectly outside the custom function, it doesn't work within it.
Thank you very much in advance.
library (sjstats)
cramer2 <- function(dta, x, y){
effsize <- cramer(x ~ y, data = dta)
return(effsize)
}
cramer2(x=gender, y=age, dta=df)
Error in loglin(data, margins, start = start, fit = fitted, param = param, :
falsche Spezifikationen für 'table' oder 'start'
This happens because x and y are not automatically substituted in a formula for the variables you have passed. Look:
f <- function(x, y) {
return(x ~ y)
}
f(a, b)
#> x ~ y
If you want the variables substituted, you can do something like
f2 <- function(x, y) {
call("~", substitute(x), substitute(y))
}
f2(a, b)
#> a ~ b
So in your case you can do:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", substitute(x), substitute(y)))
effsize <- cramer(f, data = dta)
return(effsize)
}
Obviously we don't have your data, but using the built-in data set efc we can demonstrate this works as expected:
data(efc)
cramer2(efc, e16sex, c161sex)
#> [1] 0.05258249
Created on 2022-02-27 by the reprex package (v2.0.1)
The solution provided by Allan works perfectly fine if your function does not target the variables with quotation marks, i.e. customfunction(dta=mydata, x= gender, y=age, weight=dataweight).
If, however, you must for some reason target the variables with quotation marks, e.g. customfunction(dta=mydata, x= "gender", y="age", weight="dataweight").
Then replace substitute with sym:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", sym(x), sym(y)))
effsize <- cramer(f, weights=dta[[weight]], data = dta)
return(effsize)
}

Subsetting in a second level R function

Function foo1 can subset a list by a requested variable (e.g., by = type == 1). Otherwise, foo1 will simply output the inputted list itself.
For my purposes, I need to use foo1 within a new function called foo2.
In my code below, my desired output is obtained like so: foo2(data = D, by = G[[1]]) ; foo2(data = D, by = G[[2]]) ; foo2(data = D, by = G[[3]]).
But, I wonder why when I loop over G using lapply, I get an error as shown below?
foo1 <- function(data, by){
L <- split(data, data$study.name) ; L[[1]] <- NULL
if(!missing(by)){
L <- lapply(L, function(x) do.call("subset", list(x, by)))
}
return(L)
}
foo2 <- function(data, by){
eval(substitute(foo1(data = data, by = by)))
}
## EXAMPLE OF USE:
D <- read.csv("https://raw.githubusercontent.com/izeh/i/master/k.csv", h = T) ## Data
G <- lapply(unique(na.omit(D$type)), function(i) bquote(type == .(i)))# all levels of `type`
foo2(data = D, by = G[[1]]) # Works fine without `lapply` :-)
lapply(1:3, function(i) foo2(data = D, by = G[[i]])) # Doesn't work with `lapply`! :-(
# Error in do.call("subset", list(x, by)) : object 'i' not found
Your foo2 function tries to evaluate the expression
foo1(data = D, by = G[[i]])
but it doesn't have i available. You need to evaluate G[[i]] in the anonymous function you're passing to lapply to get an expression defining the subset, and then evaluate that subset in foo2. I recommend naming that function instead of using an anonymous one; it makes debugging a lot easier.
Here's some recoding that appears to work:
Redefine foo2 to
foo2 <- function(data, by){
by <- eval(by, envir = data)
foo1(data = data, by = by)
}
and
foo3 <- function(i) {
expr <- G[[i]]
foo2(data = D, by = expr)
}
and then
lapply(1:3, foo3)
I'm not sure this does exactly what you want, but it should be close enough that you can fix it up.
Instead of using lapply, here a for loop can be used
lst1 <- vector("list", length(G))
for(i in 1:3) lst1[[i]] <- foo2(data = D, by = G[[i]])
-checking
identical(lst1[[2]], foo2(data = D, by = G[[2]]))
#[1] TRUE
identical(lst1[[3]], foo2(data = D, by = G[[3]]))
#[1] TRUE
For the lapply part, there seems to be a conflict with i anonymous function which is also called in the G. If we use a new variable say 'j'
lst2 <- lapply(1:3, function(j) foo1(data = D, by = G[[j]]))
should work
identical(lst2[[2]], lst1[[2]])
#[1] TRUE

Error in if ((dimension < 1) | (dimension > n)) stop("wrong embedding dimension") : argument is of length zero

my code is like the following:
unemp <- c(1:10)
bsp_li <- list(c(1:10),c(11:20),c(21:30))
var_data_rep <- lapply(bsp_li, function(x) {cbind(as.numeric(x), as.numeric(unemp))} ) # Create VAR data matrices
var_data_rep2 <- lapply(var_data_rep, function(x) {colnames(x) = c("rGDP", "U"); return(x)}) # Name columns
var_data_rep_ts <- lapply(var_data_rep2, function(x) {ts(x, frequency=1, start=c(1977))} ) # Make it ts again
var_data_rep_lag <- lapply(var_data_rep_ts, function(x) {VARselect(x, lag.max = 5, type = "const")} ) # Take lag with lowest SC criteria (VAR.pdf)
VARgdp_rep <- lapply(var_data_rep_ts, function(x) {VAR(x, p = var_data_rep_lag$x$selection[['SC(n)']], type = "const"); return(x)} ) # Lag=lowest SC criteria from var_data_rep_lag
if i run only the last line r always gives me the error:
Error in if ((dimension < 1) | (dimension > n)) stop("wrong embedding dimension") :
argument is of length zero
Called from: embed(y, dimension = p + 1)
But if im running it with Source then it seems to work.. any suggestions?
This seems to work (at least no error is thrown) :
VARgdp_rep <-
lapply(index(var_data_rep_ts),
function(x) {
res <- VAR(var_data_rep_ts[[x]], p =
var_data_rep_lag[[x]]$selection[['SC(n)']], type = "const");
return(res)
}
)
In you code, return(x) is strange because after doing VAR calculations .. you just return the x withc was pass to the function.
And $x seems to have no meaning here.

Formula evaluation with mutate()

Is there a way to make mutate() evaluate formulas in (d)plyr package of R? I think of situations where one has many variables like count.a, count.b, ..., count.z and I would like to create a new variable to sum all these. I can create a character string like "count.total = count.a + count.b + (...) + count.z", but how to make mutate() evaluate it?
If you want expression input
library(dplyr)
df = data.frame(x = 1:10, y = 2:11)
f = function(df, s){
eval(substitute(mutate(df, z = s)))
}
f(df, x-y)
f(df, x+y)
If you want character input
g = function(df, s){
q = quote(mutate(df, z = s))
eval(parse(text=sub("s", s, deparse(q))))
}
g(df, "x-y")
g(df, "x+y")
You can also modify the functions to take the name of z as an input.
Expression input: f1 passes all extra parameters to mutate, f2 only passes one argument to mutate.
f1 = function(df, ...){
mutate(df, ...)
}
f1(df, a = x-y)
f2 = function(df, ...){
dots = substitute(alist(...))
var = names(dots)[2]
cal = as.call(list(quote(mutate), quote(df)))
cal[var] = dots[2]
eval(cal)
}
f2(df, a = x-y)
Again, if you want to use character input
g1 = function(df, s){
q = quote(mutate(df, z = s))
eval(parse(text=sub("z = s", s, deparse(q))))
}
g1(df, "a=x-y")
g1(df, "w=x+y")

Error in terms.formula(formula) : '.' in formula and no 'data' argument

I'm tring to use neuralnet for prediction.
Create some X:
x <- cbind(seq(1, 50, 1), seq(51, 100, 1))
Create Y:
y <- x[,1]*x[,2]
Give them a names
colnames(x) <- c('x1', 'x2')
names(y) <- 'y'
Make data.frame:
dt <- data.frame(x, y)
And now, I got error
model <- neuralnet(y~., dt, hidden=10, threshold=0.01)
error in terms.formula(formula) : '.' in formula and no 'data'
argument
For example, in lm(linear model) this is worked.
As my comment states, this looks like a bug in the non-exported function neuralnet:::generate.initial.variables. As a work around, just build a long formula from the names of dt, excluding y, e.g.
n <- names(dt)
f <- as.formula(paste("y ~", paste(n[!n %in% "y"], collapse = " + ")))
f
## gives
> f
y ~ x1 + x2
## fit model using `f`
model <- neuralnet(f, data = dt, hidden=10, threshold=0.01)
> model
Call: neuralnet(formula = f, data = dt, hidden = 10, threshold = 0.01)
1 repetition was calculated.
Error Reached Threshold Steps
1 53975276.25 0.00857558698 1967
Offering a simpler alternative to the previous answer, you can create a formula from names of dt using reformulate():
f <- reformulate(setdiff(colnames(dt), "y"), response="y")
reformulate() doesn't require the use of paste() and automatically adds the terms together.
To expand a formula
f <- formula(terms(f, data= dt))
or even shorter
f <- formula(dt, f)
where f is the formula and dt is the data.
For instance, the original formula could be:
f <- as.formula("y ~ .")

Resources