R pipe, mget, and environments - r

I'm posting this in hopes someone could explain the behavior here. And perhaps this may save others some time in tracking down how to fix a similar error.
The answer is likely somewhere here in this vignette by Hadley Wickham and Lionel Henry. Yet it will take someone like me weeks of study to connect the dots.
I am running a number of queries from a remote database and then combining them into a single data.table. I add the "part_" prefix to the name of each individual query result and use ls() and mget() with data.table's rbindlist() to combined them.
This works:
results_all <- rbindlist(mget(ls(pattern = "part_", )))
I learned that approach, probably from list data.tables in memory and combine by row (rbind), and it is a helpful thing to know how to do for sure.
For readability, I often prefer using the magrittr pipe (or chaining with data.table) and especially so with projects like this because I use dplyr to query the database. Yet this code results in an error:
results_all <- ls(pattern = "part_", ) %>%
mget() %>%
rbindlist()
The error reads Error: value for ‘part_a’ not found where part_a is the first object name in the character vector returned by ls().
Searching that error message, I came across the discussion in this data.table Github issue. Reading through that, I tried setting "inherits = TRUE" within mget() like so:
results_all <- ls(pattern = "part_", ) %>%
mget(inherits = TRUE) %>%
rbindlist()
And that works. So the error is happening when piping the result of ls() to mget(). And given that nesting ls() within mget() works, my guess is that it is something to do with the pipe and "the enclosing frames of the environment".
In writing this up, I came across Unexpected error message while joining data.table with rbindlist() using mget(). From the discussion there I found out that this also works.
results_all <- ls(pattern = "part_", ) %>%
mget(envir = .GlobalEnv) %>%
rbindlist()
Again, I am hoping someone can explain what is going on for folks looking to learn more about how environments work in R.
Edit: Adding reproducible example
Per the request for a reproducible answer, running the code above using these three data.tables (data.frames or tibbles will behave the same) should do it.
part_a <- data.table(col1 = 1:10, col2 = sample(letters, 10))
part_b <- data.table(col1 = 11:20, col2 = sample(letters, 10))
part_c <- data.table(col1 = 21:30, col2 = sample(letters, 10))

The rhs argument to a pipe operator (in your example, the expression mget()) is never evaluated as a function call by the interpreter. The pipe operator is an infix function that performs non-standard evaluation of its second argument (rhs). The pipe function composes and performs a new function call using the RHS expression as a sort of "template".
The calling environment of this new function call is the function environment of %>%, not the calling environment of the lhs function or the global environment. .GlobalEnv and the calling environment of the lhs function happen to be the same environment in your example, and that environment is a parent to the function environment of %>%, which is why inherits = TRUE or setting the environment to .GlobalEnv works for you.

Related

interpolate_from_env: is call to setNames redundant?

I've been exploring engine.R to improve my understanding of how the sql engine for Knitr works. I noticed a call to setNames in the definition for interpolate_from_env that looks like it may be redundant (at least to my relatively inexperienced eyes!)
I am unsure if it is appropriate to reproduce the entirety of the function definition below, so I'm just including the lines in question:
args = if (length(names) > 0) setNames(
mget(names, envir = env), names)
setNames appears to be called on the result of mget, which already returns a named list of objects. Similarly, the call to identical below returns TRUE:
identical(
mget(names, envir = env),
setNames(mget(names, envir = env, names)
)
Have I overlooked something here or is the call to setNames in fact redundant?
TIA!

How do you call a function that takes no inputs within a pipeline?

I tried searching for this but couldn't find any similar questions. Let's say, for the sake of a simple example, I want to do the following using dplyr's pipe %>%.
c(1,3,5) %>% ls() %>% mean()
Setting aside what the use case would be for a pipeline like this, how can I call a function "mid-pipeline" that doesn't need any inputs coming from the left-hand side and just pass them along to the next function in the pipeline? Basically, I want to put an "intermission" or "interruption" of sorts into my pipeline, let that function do its thing, and then continue on my merry way. Obviously, the above doesn't actually work, and I know the T pipe %T>% also won't be of use here because it still expects the middle function to need inputs coming from the lhs. Are there options here shy of assigning intermediate objects and restarting the pipeline?
With the ‘magrittr’ pipe operator you can put an operand inside {…} to prevent automatic argument substitution:
c(1,3,5) %>% {ls()} %>% mean()
# NA
# Warning message:
# In mean.default(.) : argument is not numeric or logical: returning NA
… but of course this serves no useful purpose.
Incidentally, ls() inside a pipeline is executed in its own environment rather than the calling environment so its use here is even less useful. But a different function that returned a sensible value could be used, e.g.:
c(1,3,5) %>% {rnorm(10)} %>% mean()
# [1] -0.01068046
Or, if you intended for the left-hand side to be passed on, skipping the intermediate ls(), you could do the following:
c(1,3,5) %>% {ls(); .} %>% mean()
# [1] 3
… again, using ls() here won’t be meaningful but some other function that has a side-effect would work.
You could define an auxilliary function like this, that takes an argument it doesn't use in order to allow it to fit in the pipe:
ls_return_x <- function(x){
print(ls())
x
}
c(1,3,5) %>% ls() %>% mean()
Note, the ls() call in this example will print the objects in the environment within the ls_return_x() function. Check out the help page for ls() if you want to print the environment from the global environment.
I don't know if there is an inbuilt function but you could certainly create a helper function for this
> callfun <- function(x, fun){fun(); return(x)}
> c(1, 3, 5) %>% callfun(fun = ls) %>% mean()
# [1] 3
I don't really see the point but hey - it's your life.

dplyr, rlang: Unable to predict if minor varients of passing names to nested dplyr functions will work

Data for reproducibility
.i <- tibble(a=2*1:4+1, b=2*1:4)
This function is supposed to take its data and other arguments as unquoted names, find those names in the data, and use them to add a column and filter out the
top row. It does not work. Mutate says it can not find a.
t1 <- function(.j=.i, X=a, Y=b){
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=.j, pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
This function, which I found by typo -- note the .i instead of .j in the mutate statement -- does what the previous function was supposed to do. And I don't know why. I think it is skipping over the function arguments and finding .i in the global environment. Or maybe it is using a ouiji board.
t2 <- function(.j=.i, X=a, Y=b){
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=.i, pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
Since mutate could not find .j when passed to it in the usual R way, maybe it needs to be passed in an rlang-style quosure, like the formals X and Y. This function also does not work, with UQ in mutate saying that it can not find a. Like the first function above, it works if the .j in mutate is replaced with a .i. (Seems like there should be an "enquos" to parallel quos).
t3 <- function(.j=.i, X=a, Y=b){
e_j <- enquo(.j)
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=UQ(.j), pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
Finally, it appears that, once the .i substitution in mutate is made, t4() no longer needs a data argument at all. See below, where I replace it with bop_foo_foo. If, however, you replace bop_foo_foo throughout with the name of the data, .i, (t5()) then UQ again fails to find a.
bop_foo_foo <- 0
t4 <- function(bop_foo_foo, X=a, Y=b){
e_j <- enquo(bop_foo_foo)
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=UQ(.i), pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
The functions above seem to me to be relatively minor variants on a single function. I have run dozens more, and although I have observed some patterns,
and read the enquo and UQ help files I do not know how many times, a real
understanding continues to elude me.
I would like to know why the functions above that that don't work don't, and why the ones that do work do. I don't necessarily need a function by function critique. If you can state general principles that embody the required, understanding, that would be delightful. And more than sufficient.
I think it is skipping over the function arguments and finding .i in the global environment.
Yes, scope of symbols in R is hierarchical. The variables local to a function are looked up first, and then the surrounding environment of the function is inspected, and so on.
mutate(.data = UQ(.j), ...)
I think you are missing the difference between regular arguments and (quasi)quoted arguments. Unquoting is only relevant for quasiquoted arguments. Since the .data argument of mutate() is not quasiquoted it does not make sense to try and unquote stuff. The quasiquoted arguments are the ones that are captured/quoted with enexpr() or enquo(). You can tell whether an argument is quasiquoted either by looking at the documentation or by recognising that the argument supports direct references to columns (regular arguments need to be explicit about where to find the columns).
In the next version of rlang, the exported UQ() function will throw an error to make it clear that it should not be called directly and that it can only be used in quasiquoted arguments.
I would suggest:
Call the first argument of your function data or df rather than .i.
Don't give it a default. The user should always supply the data.
Don't capture it with enquo() or enexpr() or substitute(). Instead pass it directly to the data argument of other verbs.
Once this is out of the way it will be easier to work out the rest.

Why assigning to globalenv not allowed when creating R package?

I'm creating a R package and I have a function that returns an object which its name is constructed with the argument passed.
I use the function assign() to do this as in the code below and it works fine.
df <- data.frame(A = 1:10, B = 2:11, C = 5:14)
ot_test <- function(df, min){
tmp <- colSums(df)
tmp2 <- df[, tmp >= min]
assign(paste0(deparse(substitute(df)), "_min_", min), tmp2, envir= .GlobalEnv)
}
ot_test(df,60)
ls()
[1] "df" "df_min_60" "ot_test"
But when I check the package with devtools::check I have the message.
Found the following assignments to the global environment:
File 'test/R/ottest.R':
assign(paste0(deparse(substitute(df)), "_min_", min), tmp2, envir = .GlobalEnv)
Is there a way to do the same without having .GlobalEnv in argument or without using the function assign().
Its just an ugly, bad thing to do in a functional programming environment.
What's wrong with:
df_min_60 = ot_test(df,60)
Your argument will be that your method saves a bit of typing, but it opens you up to all sorts of bugs and obscurities.
Suppose I want to call ot_test in a function, in a loop maybe. Now its stomped on, with no warning or obvious clue its going to do it, the df_min_60 in my global workspace. Gee thanks for that. So what do I have to do?
ot_test(df, 60)
# now rename so I don't stomp on it
df_min_60.1 = df_min_60
results = domyloop(d1,d2,d3)
Which has meant more typing.
Now another idea. Suppose I want to call ot_test on a list of data frames and make a list of the results. Normally I'd do something like:
for(i in 1:10){res[[i]] = ot_test(data[[i]], 60)}
but with your code I can't. I have to do:
for(i in 1:10){d=data[[i]]; ot_test(d,60); res[[i]] = d_min_60)}
which is WAY more typing.
Be thankful that devtools::check only gives a message and doesn't set your computer on fire for doing this. Seriously, don't create things in the global environment, return them as return values.

cannot combine with and functions in R

Could somebody please point out to me why is that the following example does not work:
df <- data.frame(ex =rep(1,5))
sample.fn <- function(var1) with(df, mean(var1))
sample.fn(ex)
It seems that I am using the wrong syntax to combine with inside of a function.
Thanks,
This is what I meant by learning to use "[" (actually"[["):
> df <- data.frame(ex =rep(1,5))
> sample.fn <- function(var1) mean(df[[var1]])
> sample.fn('ex')
[1] 1
You cannot use an unquoted ex since there is no object named 'ex', at least not at the global environment where you are making the call to sample.fn. 'ex' is a name only inside the environment of the df-dataframe and only df itself is "visible" when the sample.fn-function is called.
Out of interest, I tried using the method that the with.default function uses to build a function taking an unquoted expression argument in the manner you were expecting:
samp.fn <- function(expr) mean(
eval(substitute(expr), df, enclos = parent.frame())
)
samp.fn(ex)
#[1] 1
It's not a very useful function, since it would only be applicable when there was a dataframe named 'df' in the parent.frame(). And apologies for incorrectly claiming that there was a warning on the help page. As #rawr points out the warning about using functions that depend on non-standard evaluation appears on the subset page.

Resources