Difference between lazy and substitute in R - r

I'm trying to use the lazyeval package to create non-standard evaluation in R, but was confused about what's the difference between substitute and lazy.
df <- data.frame(col1 = runif(10), col2 = runif(10))
> df
col1 col2
1 0.54959138 0.8926778
2 0.99857207 0.9649592
3 0.26451336 0.9243096
4 0.98755113 0.7155882
5 0.84257525 0.5918387
6 0.20692997 0.5875944
7 0.44383744 0.5839235
8 0.44014903 0.1006080
9 0.49835993 0.7637619
10 0.07162048 0.3155483
I first created a function to take a data frame and two column names and return a column that is the sum of the two columns. substitute and eval seem to work just fine.
SubSum <- function(data, x, y) {
exp <- substitute(x+y)
r <- eval(exp, data)
return(cbind(data, data.frame(sum=r)))
}
> SubSum(df, col1, col2)
col1 col2 sum
1 0.54959138 0.8926778 1.4422692
2 0.99857207 0.9649592 1.9635312
3 0.26451336 0.9243096 1.1888229
4 0.98755113 0.7155882 1.7031394
5 0.84257525 0.5918387 1.4344140
6 0.20692997 0.5875944 0.7945244
7 0.44383744 0.5839235 1.0277610
8 0.44014903 0.1006080 0.5407570
9 0.49835993 0.7637619 1.2621218
10 0.07162048 0.3155483 0.3871688
I then tried to create a function with lazy and lazy_eval, but it didn't work.
require(lazyeval)
LazySum <- function(data, x, y) {
exp <- lazy(x+y)
r <- lazy_eval(exp, data)
return(cbind(data, data.frame(sum=r)))
}
> LazySum(df, col1, col2)
Error in eval(expr, envir, enclos) : object 'col1' not found
My current answer
After some trial and error, this snippet seems to work.
LazySum <- function(data, x, y) {
exp <- interp(~x + y, x=lazy(x), y=lazy(y))
r <- lazy_eval(exp, data)
return(cbind(data, data.frame(sum=r)))
}
Basically I had to build the lazy expression myself using interp.

You were pretty close.
read ?lazy especially the examples to understand the changes I made to your code
require(lazyeval)
set.seed(357)
df <- data.frame(col1 = runif(10), col2 = runif(10))
LazySum <- function(data, sum=x+y) {
exp <- lazy(sum) #giving lazy a named arguement
r <- lazy_eval(exp, data)
return(cbind(data, data.frame(sum=r)))
}
LazySum(df, col1+col2)

Related

Using same function on multiple datasets and using specific columns

I have 8 datasets and I want to apply a function to convert any number less than 5 to NA on 3 columns(var1,var2,var3) of each dataset. How can I write a function to do it effectively and faster ? I went through lots of such questions on Stack overflow but I didnt find any answer where specific columns were used. I have written the function to replace but cant figure out how to apply to all the datasets.
Input:
Data1
variable1 variable2 variable3 variable4
10 36 56 99
15 3 2 56
4 24 1 1
Expected output:
variable1 variable2 variable3 variable4
10 36 56 99
15 NA NA 56
NA 24 NA 1
Perform the same thing for 7 more datasets.
Till now I have stored the needed variables and datasets in two different list.
var1=enquo(variable1)
var2=enquo(variable2)
var3=enquo(variable3)
Total=3
listofdfs=list()
listofdfs_1=list()
for(i in 1:8) {
df=sym((paste0("Data",i)))
listofdfs[[i]]=df
}
for(e in 1:Ttoal) {
listofdfs[[e]]= eval(sym(paste0("var",e)))
}
The selected columns will go through this function:
temp_1=function(x,h) {
h=enquo(h)
for(e in 1:Total) {
if(substr(eval(sym(paste0("var",e))),1,3)=="var") {
y= x %>% mutate_at(vars(!!h), ~ replace(., which(.<=5),NA))
return(y)
}
}
}
I was expecting something :
lapply(for each dataset's selected columns,temp_1)
Here's a simple approach that should work:
cols_to_edit = paste0("var", 1:3)
result_list = lapply(list_of_dfs, function(x) {
x[cols_to_edit][x[cols_to_edit] < 5] = NA
return(x)
})
I assume your starting data is in a list called list_of_dfs, that the names of columns to edit are the same in all data frames, and that you can construct a character vector cols_to_edit with those names.
Here is a solution to the problem in the question.
First of all, create a test data set.
createData <- function(Total = 3){
numcols <- Total + 1
set.seed(1234)
for(i in 1:8){
tmp <- replicate(numcols, sample(10, 20, TRUE))
tmp <- as.data.frame(tmp)
names(tmp) <- paste0("var", seq_len(numcols))
assign(paste0("Data", i), tmp, envir = .GlobalEnv)
}
}
createData()
Now, the data transformation.
This is much easier if the many dataframes are in a "list".
df_list <- mget(ls(pattern = "^Data"))
I will present solutions, a base R solution and a tidyverse one. Note that both solutions will use function temp_1, written in base R only.
library(tidyverse)
temp_1 <- function(x, h){
f <- function(v){
is.na(v) <- v <= 5
v
}
x[h] <- lapply(x[h], f)
x
}
h <- grep("var[123]", names(df_list[[1]]), value = TRUE)
df_list1 <- lapply(df_list, temp_1, h)
df_list2 <- df_list %>% map(temp_1, h)
identical(df_list1, df_list2)
#[1] TRUE

Why are values converted to strings in the rollapply window?

More newbie questions... I am trying to understand why rollapply is turning all my columns to strings. Suppose I have this:
> df <- data.frame(col1=c(1,2,3,4),
col2=c("a","b","c","d"),
col3=c("!","#","#","$"),
stringsAsFactors = F))
> v <- zoo(df, toupper(df$col2))
> v
col1 col2 col3
A 1 a !
B 2 b #
C 3 c #
D 4 d $
And then I run rollapply:
> rollapply(v, 2, by.column = F, function(x) {
+ sum(x[,"col1"])
+ })
Error in sum(x[, "col1"]) : invalid 'type' (character) of argument
Why is col1 now a character? and how do I fix it so I get a slice of my original zoo object in each window?
Rolled my own rollapply function based on some reading of other posts on SO. This just returns the indexes into the data (i.e. the zoo object):
rollapply.list <- function(data, width, FUN) {
len <- NROW(data)
add <- rep(0:(len-width),each=width)
lst <- rep(1:(width),len-width+1)
seq.list <- split(lst+add, add)
lapply(seq.list, FUN)
}
and then apply the indexes to the original data like:
rollapply.list(data=v, width=2, FUN=function(x) {
slice <- v[x] #slice out indexes from the original zoo object
...
}

Solve iteratively dataframe in R

i am trying to build a dataframe (df2) based on the following relationship: df1[i,j] = df2[i,j]^2. For doing this, i need to solve a system of non-linear equations:
library(nleqslv)
df1 = data.frame(a = c(9,9), b = c(9,9))
df2 = df1
for(i in colnames(df1)){
f = function(x) {df1[i] - x^2}
xstart = c(df2[i])
df2[i] = nleqslv(xstart, f)[[1]]
}
The expected result is:
a b
1 3 3
2 3 3
But i get the following error message:
Error in nleqslv(xstart, f) :
Argument 'x' cannot be converted to numeric!
not sure what causes the problem. Could you give me some advice please?
Well, I don't know what you are trying to accomplish, but I think the function you defined has to be fixed. You can do it in the following manner, although the answer is not correct.
f <- function(x) x - x^2
df1 = data.frame(a = c(9,9), b = c(9,9))
sapply(df1, function(y) nleqslv(y, f)[[1]])
You should instead use sqrt() since it is vectorized.
sqrt(df1)
# a b
# 1 3 3
# 2 3 3
I'm unclear as to why you need such a complex solution for such a simple operation (df2 <- sqrt(df1) would produce your example solution). But if you want to know what's producing that error, it comes down to how R indexes lists.
df1[1] returns a list, whereas df1[[1]] (double brackets) returns the vector. The nleqslv function expects vectors. So all we have to do is modify your existing code to use double brackets instead of singles:
library(nleqslv)
df1 = data.frame(a = c(9,9), b = c(9,9))
df2 = df1
for(i in colnames(df1)){
f = function(x) {df1[[i]] - x^2}
xstart = c(df2[[i]])
df2[i] = nleqslv(xstart, f)[[1]]
}
First creating the data:
df2 <- data.frame(a=c(9,9), b=c(9,9))
df1 <- df2
Now on solving it iteratively, here's the R code:
for(i in 1:nrow(df1)){
for(j in 1:ncol(df1)){
df2[i, j] <- sqrt(df1[i,j])
}
}
df2
This will return:
<dbl>
a b
3 3
3 3
You could have used a vectorized solution (df2 <- sqrt(df1)) to achieve the above as well, but the loop function above will work for you if you need to solve for it iteratively using a traditional loop.

R: Function arguments and lapply nested in a function or called from external function with data.table

Still new to data.table and working with environments.
I have a data.table similar to this (although much larger):
mydt <- data.table(ID = c("a", "a", "a", "b", "b", "b"),
col1 = c(1, 2, 3, 4, 5, 6),
col2 = c(7, 8, 9, 10, 11, 12),
key = "ID")
I wrote a function that takes mydt, splits it in a list of data.tables by its key, and then in each table in the list of data.tables takes the column, specified by the user in an argument and multiplies it by a number, provided by the user in another argument:
myfun <- function(data, constant, column) {
data <- split(x = data, by = key(data))
data <- lapply(data, function(i) {
i[ , (column) := get(column)*constant]
})
return(data)
}
x <- myfun(data = mydt, constant = 3, column = "col1")
x
$a
ID col1 col2
1: a 3 7
2: a 6 8
3: a 9 9
$b
ID col1 col2
1: b 12 10
2: b 15 11
3: b 18 12
If I understand correctly the scoping rules in R, lapply will look in the environment it was called in, will find the column and constant provided as arguments to myfun and will use them.
However, the function passed to lapply is much longer and more complex than the one here and it will be used in other functions that do many other things than just splitting the data.table. This is why I would like to define this part as an external function that will be called within other functions. This is what I did:
split.dt <- function(data) {
split(data, by = key(data))
}
mult <- function(data) {
lapply(data, function(i) {
i[ , (column) := get(column)*constant]
})
}
myfun <- function(data, constant, column) {
data <- split.dt(data = data)
data <- mult(data = data)
}
x <- myfun(data = mydt, constant = 3, column = "col1")
An error is returned:
Error in eval(expr, envir, enclos) : object 'column' not found
What I tried is wrapping column like i[ , eval(column)] and i[ , eval(column)] within the mult function with parent.frame() and parent.env() without any success. At the end I reached a solution where I used sys.call to get the arguments passed to myfun in a list and use them in mult like this:
split.dt <- function(data) {
split(data, by = key(data))
}
mult <- function(data) {
supplied.col <- sys.call(which = -1)[["column"]]
supplied.constant <- sys.call(which = -1)[["constant"]]
lapply(data, function(i) {
i[ , eval(supplied.col) := get(supplied.col)*supplied.constant]
})
}
myfun <- function(data, constant, column) {
data <- split.dt(data = data)
data <- mult(data = data)
}
x <- myfun(data = mydt, constant = 3, column = "col1")
x
$a
ID col1 col2
1: a 3 7
2: a 6 8
3: a 9 9
$b
ID col1 col2
1: b 12 10
2: b 15 11
3: b 18 12
It does work, BUT I am not sure if:
This is the right or most efficient approach. What is the way to make mult look up at the arguments supplied to myfun?
Will this work if the functions are wrapped in a package?
1) Just pass column and constant to mult as additional arguments.
mult <- function(data, constant, column) {
lapply(data, function(i) {
i[ , (column) := get(column)*constant]
})
}
myfun <- function(data, constant, column) {
data <- split.dt(data = data)
data <- mult(data, constant, column)
}
2) Alternately define mult as:
mult <- function(data, envir = parent.frame()) with(envir,
lapply(data, function(i) {
i[ , (column) := get(column)*constant]
})
)
2a) or
mult <- function(data, envir = parent.frame()) {
constant <- envir$constant
column <- envir$column
lapply(data, function(i) {
i[ , (column) := get(column)*constant]
})
}

NSE lazyeval::lazy vs. substitute when referring to variable names

I'm still trying to wrap my head around non-standard evaluation and how it's used in dplyr. I'm having trouble understanding why lazy evaluation is important when the function arguments are variable names, and so the original context's environment doesn't seem important.
In the code below, the function select3() uses lazy evaluation, but fails (I believe) because it tries to follow the variable name order all the way to base::order.
Is it okay to use substitute as I have in my select4(), or is there some other way I should implement this function? When would it actually be important to save the original environment, when I really want those arguments to refer to variables?
Thank you!
library(dplyr)
library(lazyeval)
# Same as dplyr::select
select2 <- function(.data, ...) {
select_(.data, .dots = lazy_dots(...))
}
# I want to have two capture groups of variables, so I need named arguments.
select3 <- function(.data, group1, group2) {
out1 <- select_(.data, .dots = lazy(group1))
out2 <- select_(.data, .dots = lazy(group2))
list(out1, out2)
}
df <- data.frame(x = 1:2, y = 3:4, order = 5:6)
# select3 seems okay at first...
df %>% select2(x, y)
df %>% select3(x, y)
# But fails when the variable is a function defined in the namespace
df %>% select2(x, order)
df %>% select3(x, order)
# Error in eval(expr, envir, enclos) : object 'datafile' not found
# Using substitute instead of lazy works. But I'm not sure I understand the
# implications of doing this.
select4 <- function(.data, group1, group2) {
out1 <- select_(.data, .dots = substitute(group1))
out2 <- select_(.data, .dots = substitute(group2))
list(out1, out2)
}
df %>% select4(x, order)
PS on a related note, is this a bug or intended behavior?
select(df, z)
# Error in eval(expr, envir, enclos) : object 'z' not found
# But if I define z as a numeric variable it works.
z <- 1
select(df, z)
Update
A. Webb points out below that the environment is important for select because the special functions like one_of can use objects from it.
Update 2
I used to have an ugly hack as a fix, but here's a much better way; I should've known that even lazy has a standard evaluation counter-part lazy_
select6 <- function(.data, group1, group2) {
g1 <- lazy_(substitute(group1), env = parent.frame())
g2 <- lazy_(substitute(group2), env = parent.frame())
out1 <- select_(.data, .dots = g1)
out2 <- select_(.data, .dots = g2)
list(out1, out2)
}
# Or even more like the original...
lazy_parent <- function(expr) {
# Need to go up twice, because lazy_parent creates an environment for itself
e1 <- substitute(expr)
e2 <- do.call("substitute", list(e1), envir = parent.frame(1))
lazy_(e2, parent.frame(2))
}
select7 <- function(.data, group1, group2) {
out1 <- select_(.data, .dots = lazy_parent(group1))
out2 <- select_(.data, .dots = lazy_parent(group2))
list(out1, out2)
}
The problem here is that lazy by default follows promises, and order is a promise due to lazy loading of packages.
library(pryr)
is_promise(order)
#> TRUE
The default for lazy_dots, as used in select, is the opposite.
But there is something else going on here too, where the nature of the special ... is used to extract unevaluated expressions. While your use of substitute will work in many situations, attempts at renaming as available via select will fail.
select4(df,foo=x,bar=order)
#> Error in select4(df, foo = x, bar = order) :
#> unused arguments (foo = x, bar = order)
However, this works
select5 <- function(.data, ...) {
dots<-lazy_dots(...)
out1 <- select_(.data, .dots=dots[1])
out2 <- select_(.data, .dots=dots[2])
list(out1, out2)
}
select5(df,foo=x,bar=order)
#> [[1]]
#> foo
#> 1 1
#> 2 2
#>
#> [[2]]
#> bar
#> 1 5
#> 2 6
As another example, where substitute fails more directly, due to lack of carrying an environment, consider
vars<-c("x","y")
select4(df,one_of(vars),order)
#>Error in one_of(vars, ...) : object 'vars' not found
select5(df,one_of(vars),order)
#> [[1]]
#> x y
#> 1 1 3
#> 2 2 4
#>
#> [[2]]
#> order
#> 1 5
#> 2 6
The select4 version fails because it cannot find vars, where select5 succeeds due to lazy_dots carrying around the environment. Note select4(df,one_of(c("x","y")),order) is okay, as it uses literals.

Resources