What does builtins(internal = TRUE) return? - r

From ?builtins:
builtins(TRUE) returns an unsorted list of the names of internal functions, that is those which can be accessed as .Internal(foo(args ...)) for foo in the list.
I don't understand which functions are being returned.
I thought it would be all the closure functions in the base package that call .Internal().
However, the two sets don't match up.
base_objects <- mget(
ls(baseenv(), all.names = TRUE),
envir = baseenv()
)
internals <- names(
Filter(
assertive.types::is_internal_function,
base_objects
)
)
builtins_true <- builtins(internal = TRUE)
c(
both = length(intersect(internals, builtins_true)),
internals_not_builtins_true = length(setdiff(internals, builtins_true)),
builtins_true_not_internals = length(setdiff(builtins_true, internals))
)
## both internals_not_builtins_true builtins_true_not_internals
## 288 125 226
I also thought that it might be the values listed in src/main/names.c in R's source code, and there definitely seems to be some overlap with this, but it isn't exactly this list of values.
What is builtins() doing when you pass internal = TRUE?

Stibu's comment is a specific example of the general problem. ?builtins says that it fetches the names of the objects it returns directly from the symbol table (this is the C symbol table).
And builtins(TRUE) returns all the built-in objects callable via .Internal. That, however, doesn't mean there must be any function that calls .Internal(foo(args, ...)) for any foo.
Stibu gave one example: the internal function may not be called by an R function with the same name, as is the case for many generic functions where the default method calls .Internal.
Another example is something like .addCondHands and .addRestart, which are called by withCallingHandlers and withRestarts, respectively.
It's also possible that one R function calls multiple .Internal functions. I don't know of an example of that off the top of my head though.

After more digging, it seems that the list of functions is everything in the R_FunTab[] object in src/main/names.c where the second digit of the eval column is 1.
Here's a script to retrieve them.
library(stringi)
library(magrittr)
library(dplyr)
names.c <- readLines("https://raw.githubusercontent.com/wch/r-source/56a1b08b7282c5488acb71ee244098f4fd94f7c7/src/main/names.c")
fun_tab <- names.c[92:974] %>%
stri_replace_all_regex("^\\{", "") %>%
stri_replace_all_fixed("{PP", "PP") %>%
stri_replace_all_fixed("}},", "") %>%
stri_replace_all_fixed("\\t", "")
funs <- read.csv(text = fun_tab, header = FALSE, comment.char = "/")
cols <- names.c[86] %>%
stri_sub(4) %>%
stri_split_regex("\\t+") %>%
extract2(1) %>%
stri_trim()
colnames(funs) <- cols
funs$eval <- formatC(funs$eval, width = 3, flag = "0")
# Internal fns have 2nd digit of eval col == 1. See names.c[62:71]
internals <- funs %>% filter_(~ substring(eval, 2, 2) == 1)
I see slight differences when examining
setdiff(internals$printname, builtins(TRUE))
setdiff(builtins(TRUE), internals$printname)
For example builtins(TRUE) doesn't include shell.exec() if you aren't running Windows; mem.limits() was only recently removed from the devel branch of R, so it shows up in builtins(TRUE) for the current release version of R.

Related

Why is print() going to change the output of my function?

I am working on a function that tries to give me the top answers of a column. In the example below there is just a part of my whole function. My final goal is to run the function over a loop. I have detected something weird: why is print(df_col_indicator) gonna change the result when I define "df_col_indicator" externally and not within my function? With print(df_col_indicator) my function is actually exactly doing what I want..
library(dplyr)
library(tidyverse)
remove(list = ls())
dataframe_test <- data.frame(
county_name = c("a", "b","c", "d","e", "f", "g", "h"),
column_test1 = c(100,100,100,100,100,100,50,50),
column_test2 = c(40,90,50,40,40,100,13,14),
column_test3 = c(100,90,50,40,30,40,100,50),
month = c("2020-09-01", "2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-08-01","2020-08-01"))
choose_top_5 <- function(df, df_col_indicator, df_col_month, char_month, numb_top, df_col_county) {
### this here changes output of my function
#print(df_col_indicator) # changes output of my function depending on included or excluded
### enquo / ensym / deparse
df_col_indicator_ensym <- ensym(df_col_indicator)
df_col_month_ensym <- ensym(df_col_month)
### filter month and top 5 observations
df_top <- df %>%
filter(!!df_col_month_ensym == char_month) %>%
slice_max(!!df_col_indicator_ensym, n = numb_top) %>%
select(!!df_col_county, !!df_col_month_ensym, !!df_col_indicator_ensym)
return(df_top)
}
### define "df_col_indicator" within the function
a = choose_top_5(df = dataframe_test, df_col_indicator = "column_test3",
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
a
### define "df_col_indicator" externally
external = "column_test3"
b = choose_top_5(df = dataframe_test, df_col_indicator = external,
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
b
### goal is to run function over loop
external <- c("column_test1","column_test2","column_test3")
my_list <- list()
for (i in external) {
my_list[[i]] <- choose_top_5(df = dataframe_test, df_col_indicator = i,
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
}
my_list
Your example is quite lengthy. Let's boil it down to a minimal reproducible example with two very similar functions. These both take a single argument and simply print the passed variable to the console, and return the result of calling ensym on the same variable.
The only difference between the two is the order in which the calls to print and ensym are made.
library(rlang)
test_ensym1 <- function(x)
{
result <- ensym(x)
print(x)
return(result)
}
test_ensym2 <- function(x)
{
print(x)
result <- ensym(x)
return(result)
}
Now we might expect these two functions to do exactly the same thing, and indeed when we pass a string directly to them, they both give the same result:
test_ensym1("hello")
#> [1] "hello"
#> hello
test_ensym2("hello")
#> [1] "hello"
#> hello
But look what happens when we use an external variable to pass in our string:
y <- "hello"
test_ensym1(y)
#> [1] "hello"
#> y
test_ensym2(y)
#> [1] "hello"
#> hello
The functions both still print "hello", as expected, but they return a different result. When we called ensym first, the function returned the symbol y, and when we called print first it returned the symbol hello.
The reason for this is that when you call a function in R, the symbols you pass as parameters are not evaluated immediately. Instead, they are interpreted as promise objects and evaluated as required in the body of the function. It is this lazy evalutation that allows for some of the tidyverse trickery.
The difference between the two functions above is that calling print(x) forces the evaluation of x. Before that point, x is an unevaluated symbol. Afterwards, it behaves just like any other variable you would use interactively in the console, so when you call ensym, you are calling it on this evaluated variable, not as an unevaluated promise.
ensym, on the other hand, does not evaluate x, so if ensym is called first, it will return the unevaluated symbol that was passed to the function.
So actually, the easiest way to fix your problem is to move print to after the ensym call.
You also have to change ensym to as.symbol.
Consider a function like this
f <- function(x) ensym(x)
myvar <- "some string"
You will find that
> f("some string")
`some string`
> f(myvar)
myvar
This is because ensym only searches for the thing one step ahead. It attempts to convert whatever thing found into a symbol and just returns that (note that if what found is neither a string nor variable, then you will get an error). As such, in your first example, ensym returns column_test3; in your second one, it returns external.
As far as I can tell, what you want to do is getting the value that df_col_indicator represents and then converting that value into a symbol. This means you have to first evaluate df_col_indicator and then convert. as.symbol does what you need.
g <- function(x) as.symbol(x)
myvar <- "some string"
Some tests
> g("some string")
`some string`
> g(myvar)
`some string`

What is the best way to pass a list of functions as argument?

I am looking for advice on the best way to code passing a list of functions as argument.
What I want to do:
I would like to pass as an argument a list a functions to apply them to a specific input. And give output name based on those.
A "reproducible" example
input = 1:5
Functions to pass are mean, min
Expected call:
foo(input, something_i_ask_help_for)
Expected output:
list(mean = 3, min = 1)
If it's not perfectly clear, please see my two solutions to have an illustration.
Solution 1: Passing functions as arguments
foo <- function(input, funs){
# Initialize output
output = list()
# Compute output
for (fun_name in names(funs)){
# For each function I calculate it and store it in output
output[fun_name] = funs[[fun_name]](input)
}
return(output)
}
foo(1:5, list(mean=mean, min=min))
What I don't like with this method is that we can't call it by doing: foo(1:5, list(mean, min)).
Solution 2: passing functions names as argument and using get
foo2 <- function(input, funs){
# Initialize output
output = list()
# Compute output
for (fun in funs){
# For each function I calculate it and store it in output
output[fun] = get(fun)(input)
}
return(output)
}
foo2(1:5, c("mean", "min"))
What i don't like with this method is that we are not really passing the function-object as argument.
My question:
Both ways works, but I not quite sure which one to choose.
Could you help me by:
Telling me which one is the best?
What are the advantages and defaults of each methods?
Is there a another (better) method
If you need any more information, don't hesitate to ask.
Thanks!
Simplifying solutions in question
The first of the solutions in the question requires that the list be named and the second requires that the functions have names which are passed as character strings. Those two user interfaces could be implemented using the following simplifications. Note that we add an envir argument to foo2 to ensure function name lookup occurs as expected. Of those the first seems cleaner but if the functions were to be used interactively and less typing were desired then the second does do away with having to specify the names.
foo1 <- function(input, funs) Map(function(f) f(input), funs)
foo1(1:5, list(min = min, max = max)) # test
foo2 <- function(input, nms, envir = parent.frame()) {
Map(function(nm) do.call(nm, list(input), envir = envir), setNames(nms, nms))
}
foo2(1:5, list("min", "max")) # test
Alternately we could build foo2 on foo1:
foo2a <- function(input, funs, envir = parent.frame()) {
foo1(input, mget(unlist(funs), envir = envir, inherit = TRUE))
}
foo2a(1:5, list("min", "max")) # test
or base the user interface on passing a formula containing the function names since formulas already incorporate the notion of environment:
foo2b <- function(input, fo) foo2(input, all.vars(fo), envir = environment(fo))
foo2b(1:5, ~ min + max) # test
Optional names while passing function itself
However, the question indicates that it is preferred that
the functions themselves be passed
names are optional
To incorporate those features the following allows the list to have names or not or a mixture. If a list element does not have a name then the expression defining the function (usually its name) is used.
We can derive the names from the list's names or when a name is missing we can use the function name itself or if the function is anonymous and so given as its definition then the name can be the expression defining the function.
The key is to use match.call and pick it apart. We ensure that funs is a list in case it is specified as a character vector. match.fun will interpret functions and character strings naming functions and look them up in the parent frame so we use a for loop instead of Map or lapply in order that we not generate a new function scope.
foo3 <- function(input, funs) {
cl <- match.call()[[3]][-1]
nms <- names(cl)
if (is.null(nms)) nms <- as.character(cl)
else nms[nms == ""] <- as.character(cl)[nms == ""]
funs <- as.list(funs)
for(i in seq_along(funs)) funs[[i]] <- match.fun(funs[[i]])(input)
setNames(funs, nms)
}
foo3(1:5, list(mean = mean, min = "min", sd, function(x) x^2))
giving:
$mean
[1] 3
$min
[1] 1
$sd
[1] 1.581139
$`function(x) x^2`
[1] 1 4 9 16 25
One thing that you are missing is replacing the for loops with lapply. Also for functional programming it is often good practice to separate functions to do one thing. I personally like the version from solution 1 where you pass the functions in directly because it avoids another call in R and therefore is more efficient. In solution 2, it is best to use match.fun instead of get. match.fun is stricter than get in searching for functions.
x <- 1:5
foo <- function(input, funs) {
lapply(funs, function(fun) fun(input))
}
foo(x, c(mean=mean, min=min))
The above code simplifies your solution 1. To add to this function, you could add some error handling such as is.numeric for x and is.function for funs.

Not fully understanding how SE works across the dplyr verbs

I'm trying to understand how SE works in dplyr so I can use variables as inputs to these functions. I'm having some trouble with understanding how this works across the different functions and when I should be doing what. It would be really good to understand the logic behind this.
Here are some examples:
library(dplyr)
library(lazyeval)
a <- c("x", "y", "z")
b <- c(1,2,3)
c <- c(7,8,9)
df <- data.frame(a, b, c)
The following is exactly why i'd use SE and the *_ variant of a function. I want to change the name of what's being mutated based on another variable.
#Normal mutate - copies b into a column called new
mutate(df, new = b)
#Mutate using a variable column names. Use mutate_ and the unqouted variable name. Doesn't use the name "new", but use the string "col.new"
col.name <- "new"
mutate_(df, col.name = "b")
#Do I need to use interp? Doesn't work
expr <- interp(~(val = b), val = col.name)
mutate_(df, expr)
Now I want to filter in the same way. Not sure why my first attempt didn't work.
#Apply the same logic to filter_. the following doesn't return a result
val.to.filter <- "z"
filter_(df, "a" == val.to.filter)
#Do I need to use interp? Works. What's the difference compared to the above?
expr <- interp(~(a == val), val = val.to.filter)
filter_(df, expr)
Now I try to select_. Works as expected
#Apply the same logic to select_, an unqouted variable name works fine
col.to.select <- "b"
select_(df, col.to.select)
Now I move on to rename_. Knowing what worked for mutate and knowing that I had to use interp for filter, I try the following
#Now let's try to rename. Qouted constant, unqouted variable. Doesn't work
new.name <- "NEW"
rename_(df, "a" = new.name)
#Do I need an eval here? It worked for the filter so it's worth a try. Doesn't work 'Error: All arguments to rename must be named.'
expr <- interp(~(a == val), val = new.name)
rename_(df, expr)
Any tips on best practice when it comes to using variable names across the dplyr functions and when interp is required would be great.
The differences here are not related to which dplyr verb you are using. They are related to where you are trying to use the variable. You are mixing whether the variable is used as a function argument or not, and whether it should be interpreted as a name or as a character string.
Scenario 1:
You want to use your variable as an argument name. Such as in your mutate example.
mutate(df, new = b)
Here new is the name of a function argument, it is left of a =. The only way to do this is to use the .dots argument. Like
col.name <- 'new'
mutate_(df, .dots = setNames(list(~b), col.name))
Running just setNames(list(~b), col.name) shows you how we have an expression (~b), which is going right of the =, and the name is going left of the =.
Scenario 2:
You want to give only a variable as a function argument. This is the simplest case. Let's again use mutate(df, new = b), but in this case we want b to be variable. We could use:
v <- 'b'
mutate_(df, .dots = setNames(list(v), 'new'))
Or simply:
mutate_(df, new = b)
Scenario 3
You want to do some combinations of variable and fixed things. That is, your expression should only be partly variable. For this we use interp. For example, what if we would like to do something like:
mutate(df, new = b + 1)
But being able to change b?
v <- 'b'
mutate_(df, new = interp(~var + 1, var = as.name(v)))
Note that we as.name to make sure that we insert b into the expression, not 'b'.

Selecting Which Argument to Pass Dynamically in R

I'm trying to pass a specific argument dynamically to a function, where the function has default values for most or all arguments.
Here's a toy example:
library(data.table)
mydat <- data.table(evildeeds=rep(c("All","Lots","Some","None"),4),
capitalsins=rep(c("All", "Kinda","Not_really", "Virginal"),
each = 4),
hellprobability=seq(1, 0, length.out = 16))
hellraiser <- function(arg1 = "All", arg2= "All "){
mydat[(evildeeds %in% arg1) & (capitalsins %in% arg2), hellprobability]}
hellraiser()
hellraiser(arg1 = "Some")
whicharg = "arg1"
whichval = "Some"
#Could not get this to work:
hellraiser(eval(paste0(whicharg, '=', whichval)))
I would love a way to specify dynamically which argument I'm calling: In other words, get the same result as hellraiser(arg1="Some") but while picking whether to send arg1 OR arg2 dynamically. The goal is to be able to call the function with only one parameter specified, and specify it dynamically.
You could use some form of do.call like
do.call("hellraiser", setNames(list(whichval), whicharg))
but really this just seems like a bad way to handle arguments for your functions. It might be better to treat your parameters like a list that you can more easily manipulate. Here's a version that allows you to choose values where the argument names are treated like column names
hellraiser2 <- function(..., .dots=list()) {
dots <- c(.dots, list(...))
expr <- lapply(names(dots), function(x) bquote(.(as.name(x)) %in% .(dots[[x]])))
expr <- Reduce(function(a,b) bquote(.(a) & .(b)), expr)
eval(bquote(mydat[.(expr), hellprobability]))
}
hellraiser2(evildeeds="Some", capitalsins=c("Kinda","Not_really"))
hellraiser2(.dots=list(evildeeds="Some", capitalsins=c("Kinda","Not_really")))
This use of ... and .dots= syntax is borrowed from the dplyr standard evaluation functions.
I managed to get the result with
hellraiser(eval(parse(text=paste(whicharg, ' = \"', whichval, '\"', sep=''))))

How to use an unknown number of key columns in a data.table

I want to do the same as explained here, i.e. adding missing rows to a data.table. The only additional difficulty I'm facing is that I want the number of key columns, i.e. those rows that are used for the self-join, to be flexible.
Here is a small example that basically repeats what is done in the link mentioned above:
df <- data.frame(fundID = rep(letters[1:4], each=6),
cfType = rep(c("D", "D", "T", "T", "R", "R"), times=4),
variable = rep(c(1,3), times=12),
value = 1:24)
DT <- as.data.table(df)
idCols <- c("fundID", "cfType")
setkeyv(DT, c(idCols, "variable"))
DT[CJ(unique(df$fundID), unique(df$cfType), seq(from=min(variable), to=max(variable))), nomatch=NA]
What bothers me is the last line. I want idCols to be flexible (for instance if I use it within a function), so I don't want to type unique(df$fundID), unique(df$cfType) manually. However, I just don't find any workaround for this. All my attempts to automatically split the subset of df into vectors, as needed by CJ, fail with the error message Error in setkeyv(x, cols, verbose = verbose) : Column 'V1' is type 'list' which is not (currently) allowed as a key column type.
CJ(sapply(df[, idCols], unique))
CJ(unique(df[, idCols]))
CJ(as.vector(unique(df[, idCols])))
CJ(unique(DT[, idCols, with=FALSE]))
I also tried building the expression myself:
str <- ""
for (i in idCols) {
str <- paste0(str, "unique(df$", i, "), ")
}
str <- paste0(str, "seq(from=min(variable), to=max(variable))")
str
[1] "unique(df$fundID), unique(df$cfType), seq(from=min(variable), to=max(variable))"
But then I don't know how to use str. This all fails:
CJ(eval(str))
CJ(substitute(str))
CJ(call(str))
Does anyone know a good workaround?
Michael's answer is great. do.call is indeed needed to call CJ flexibly in that way, afaik.
To clear up on the expression building approach and starting with your code, but removing the df$ parts (not needed and not done in the linked answer, since i is evaluated within the scope of DT) :
str <- ""
for (i in idCols) {
str <- paste0(str, "unique(", i, "), ")
}
str <- paste0(str, "seq(from=min(variable), to=max(variable))")
str
[1] "unique(fundID), unique(cfType), seq(from=min(variable), to=max(variable))"
then it's :
expr <- parse(text=paste0("CJ(",str,")"))
DT[eval(expr),nomatch=NA]
or alternatively build and eval the whole query dynamically :
eval(parse(text=paste0("DT[CJ(",str,"),nomatch=NA")))
And if this is done a lot then it may be worth creating yourself a helper function :
E = function(...) eval(parse(text=paste0(...)))
to reduce it to :
E("DT[CJ(",str,"),nomatch=NA")
I've never used the data.table package, so forgive me if I miss the mark here, but I think I've got it. There's a lot going on here. Start by reading up on do.call, which allows you to evaluate any function in a sort of non-traditional manner where arguments are specified by a supplied list (where each element is in the list is positionally matched to the function arguments unless explicitly named). Also notice that I had to specify min(df$variable) instead of just min(variable). Read Hadley's page on scoping to get an idea of the issue here.
CJargs <- lapply(df[, idCols], unique)
names(CJargs) <- NULL
CJargs[[length(CJargs) +1]] <- seq(from=min(df$variable), to=max(df$variable))
DT[do.call("CJ", CJargs),nomatch=NA]

Resources