Recode and Mutate_all in dplyr

Recode and Mutate_all in dplyr - r

I am trying to use recode and mutate_all to recode columns. However, for some reason, I am getting an error. I do believe this post is similar to how to recode (and reverse code) variables in columns with dplyr but the answer in that post has used lapply function.
Here's what I tried after reading dplyr package's help pdf.
by_species<-matrix(c(1,2,3,4),2,2)
tbl_species<-as_data_frame(by_species)
tbl_species %>% mutate_all(funs(. * 0.4))
# A tibble: 2 x 2
V1 V2
<dbl> <dbl>
1 0.4 1.2
2 0.8 1.6
So, this works well.
However, this doesn't work:
grades<-matrix(c("A","A-","B","C","D","B-","C","C","F"),3,3)
tbl_grades <- as_data_frame(grades)
tbl_grades %>% mutate_all(funs(dplyr::recode(.,A = '4.0')))
I get this error:
Error in vapply(dots[missing_names], function(x) make_name(x$expr), character(1)) :
values must be length 1,
but FUN(X[[1]]) result is length 3
Can someone please explain what's the problem and why above code isn't working?
I'd appreciate any help.
Thanks

#Mir has done a good job describing the problem. Here's one possible workaround. Since the problem is in generating the name, you can supply your own name
tbl_grades %>% mutate_all(funs(recode=recode(.,A = '4.0')))
Now this does add columns rather than replace them. Here's a function that will "forget" that you supplied those names
dropnames<-function(x) {if(is(x,"lazy_dots")) {attr(x,"has_names")<-FALSE}; x}
tbl_grades %>% mutate_all(dropnames(funs(recode=dplyr::recode(.,A = '4.0'))))
This should behave like the original. Although really
tbl_grades %>% mutate_all(dropnames(funs(recode(.,A = '4.0'))))
because dplyr often has special c++ versions of some functions that it can use if it recognized the functions (like lag for example) but this will not happen if you also specify the namespace (if you use dplyr::lag).

If we call it without the dplyr:: then it works fine.
funs(recode(., A = '4.0'))
<fun_calls>
$ recode: recode(., A = "4.0")
tbl_grades %>% mutate_all(funs(recode(. ,A = '4.0')))
# A tibble: 3 x 3
V1 V2 V3
<chr> <chr> <chr>
1 4.0 C C
2 A- D C
3 B B- F
The issue lies in the funs call. If we extract that part out the same error appears.
funs(dplyr::recode(., A = '4.0'))
Error in vapply(dots[missing_names], function(x) make_name(x$expr), character(1)) :
values must be length 1, but FUN(X[[1]]) result is length 3
The issue boils down to the fact that :: is a function itself. (see ?`::`). To visualize this a little better, we look at both the infix and prefix ways of writing the function.
`::`(dplyr, recode)
function (.x, ..., .default = NULL, .missing = NULL)
{
UseMethod("recode")
}
<environment: namespace:dplyr>
dplyr::recode
function (.x, ..., .default = NULL, .missing = NULL)
{
UseMethod("recode")
}
<environment: namespace:dplyr>
funs attempts to extract the function names of its arguments by grabbing the first element of the call object and calling as.character on it. The first element of the call object is the calling function and subsequent elements are the argument values. For example:
as.call(quote(recall(., A = '4.0')))
recall(., A = "4.0")
as.call(quote(recall(., A = '4.0')))[[1]]
recall
as.call(quote(recall(., A = '4.0')))[[2]]
.
as.call(quote(recall(., A = '4.0')))[[3]]
"4.0"
as.call(quote(recall(., A = '4.0')))[[4]]
Error in as.call(quote(recall(., A = "4.0")))[[4]] :
subscript out of bounds
This runs into issues when dplyr::recode is used because this creates a nested call object. When we grab the first element, we get not just a name of a function, but an entire function call.
as.call(quote(dplyr::recall(., A = '4.0')))
dplyr::recall(., A = "4.0")
as.call(quote(dplyr::recall(., A = '4.0')))[[1]]
dplyr::recall
as.call(quote(dplyr::recall(., A = '4.0')))[[1]][[1]]
`::`
as.call(quote(dplyr::recall(., A = '4.0')))[[1]][[2]]
dplyr
as.call(quote(dplyr::recall(., A = '4.0')))[[1]][[3]]
recall
In contrast to when recode is called without dplyr::.
as.call(quote(recall(., A = '4.0')))[[1]][[1]]
Error in as.call(quote(recall(., A = "4.0")))[[1]][[1]] :
object of type 'symbol' is not subsettable
Because the first element when dplyr:: is included is a whole function call, as.character results in a vector that has both the name of a function and its arguments.
as.call(quote(dplyr::recall(., A = '4.0')))[[1]] %>% as.character()
[1] "::" "dplyr" "recall"
Funs reasonably expects the name of the function to have only one element, not three, and thus errors out.

Related

In R, what does parentheses followed by parentheses mean

The syntax for using scales::label_percent() in a mutate function is unusual because it uses double parentheses:
label_percent()(an_equation_goes_here)
I don't think I have seen ()() syntax in R before and I don't know how to look it up because I don't know what it is called. I tried ?`()()` and ??`()()` and neither helped. What is double parentheses syntax called? Can someone recommend a place to read about it?
Here is an example for context:
library(tidyverse)
members <-
read_csv(
paste0(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/",
"master/data/2020/2020-09-22/members.csv"
),
show_col_types = FALSE)
members %>%
count(success, died) %>%
group_by(success) %>%
# old syntax:
# mutate(percent = scales::percent(n / sum(n)))
# new syntax:
mutate(percent = scales::label_percent()(n / sum(n)))
#> # A tibble: 4 × 4
#> # Groups: success [2]
#> success died n percent
#> <lgl> <lgl> <int> <chr>
#> 1 FALSE FALSE 46452 98%
#> 2 FALSE TRUE 868 2%
#> 3 TRUE FALSE 28961 99%
#> 4 TRUE TRUE 238 1%
Created on 2023-01-01 with reprex v2.0.2

Most functions return a value, whether something atomic (numeric, integer, character), list-like (including data.frame), or something more complex. For those, the single set of ()s (as you recognize) are for the one call.
Occasionally, however, a function call returns a function. For example, if we look at ?scales::label_percent, we can scroll down to
Value:
All 'label_()' functions return a "labelling" function, i.e. a
function that takes a vector 'x' and returns a character vector of
'length(x)' giving a label for each input value.
Let's look at it step-by-step:
fun <- scales::label_percent()
fun
# function (x)
# {
# number(x, accuracy = accuracy, scale = scale, prefix = prefix,
# suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark,
# style_positive = style_positive, style_negative = style_negative,
# scale_cut = scale_cut, trim = trim, ...)
# }
# <bytecode: 0x00000168ee5440e8>
# <environment: 0x00000168ee5501b8>
fun(0.35)
# [1] "35%"
The first call to scales::label_percent() returned a function. We can then use that function with as many arguments as we want.
If you don't want to store the returned function in a variable like fun, you can use it immediately by following the first set of ()s with another set of parens.
scales::label_percent()(0.35)
# [1] "35%"
A related question is "why would you want a function to return another function?" There are many stylistic reasons, but in the case of scales::label_*, they are designed to be used in places where the option needs to be expressed as a function, not as a static value. For example, it can be used in ggplot code: axis ticks are often placed conveniently with simple heuristics to determine the count, locations, and rendering of the ticks marks. While one can use ggplot2::scale_*_manual(values = ...) to manually control how many, where, and what they look like, it is often more convenient to not care a priori how many or where, and in cases where faceting is used, it can vary per faceting variable(s), so not something one can easily assign in a static variable. In those cases, it is often better to assign a function that is given some simple parameters (such as the min/max of the axis), and the function returns something meaningful.
Why can't we just pass it scales::label_percent? (Good question.) Even though you're using the default values in your call here, one might want to change any or all of the controllable things, such as:
suffix= defaults to "%", but perhaps you want a space as in " %"?
decimal.mark= defaults to ".", but maybe your locale prefers commas?
While it is feasible to have multiple functions for all of the combinations of these options, it is generally easier in the long run to provide a "template function" for creating the function, such as
fun <- scales::label_percent(accuracy = 0.01, suffix = " %", decimal.mark = ",")
fun(0.353)
# [1] "35,30 %"
scales::label_percent(accuracy = 0.01, suffix = " %", decimal.mark = ",")(0.353)
# [1] "35,30 %"

An Expression followed by an argument list in round parentheses (( / )) is called a Function Call in R.
There's no need to have a special name for two function calls in a row. They're still just function calls.

If we run a function and the value returned by the function is itself a function then we could call one that too.
For example, we first run f using f() assigning the return value to g but the return value is itself a function so g is a function -- it is the function function() 3 -- and we can run that too.
# f is a function which returns a function
f <- function() function() 3
g <- f() # this runs f which returns `function() 3`
g() # thus g is a function so we can call it
## [1] 3
Now putting that all together we can write it in one line as
f()()
## [1] 3
As seen there is only one meaning for () and the fact that there were two together was simply because we were calling the result of a call.

Separating error message from error condition in package

Background
Packages can include a lot of functions. Some of them require informative error messages, and perhaps some comments in the function to explain what/why is happening. An example, f1 in a hypothetical f1.R file. All documentation and comments (both why the error and why the condition) in one place.
f1 <- function(x){
if(!is.character(x)) stop("Only characters suported")
# user input ...
# .... NaN problem in g()
# ....
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) stop("oof, an error")
log(x)
}
f1(-1)
# >Error in f1(-1) : oof, an error
I create a separate conds.R, specifying a function (and w warning, s suggestion) etc, for example.
e <- function(x){
switch(
as.character(x),
"1" = "Only character supported",
# user input ...
# .... NaN problem in g()
# ....
"2" = "oof, and error") |>
stop()
}
Then in, say, f.R script I can define f2 as
f2 <- function(x){
if(!is.character(x)) e(1)
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) e(2)
log(x)
}
f2(-1)
#> Error in e(2) : oof, and error
Which does throw the error, and on top of it a nice traceback & rerun with debug option in the console. Further, as package maintainer I would prefer this as it avoids considering writing terse if statements + 1-line error message or aligning comments in a tryCatch statement.
Question
Is there a reason (not opinion on syntax) to avoid writing a conds.R in a package?

There is no reason to avoid writing conds.R. This is very common and good practice in package development, especially as many of the checks you want to do will be applicable across many functions (like asserting the input is character, as you've done above. Here's a nice example from dplyr.
library(dplyr)
df <- data.frame(x = 1:3, x = c("a", "b", "c"), y = 4:6)
names(df) <- c("x", "x", "y")
df
#> x x y
#> 1 1 a 4
#> 2 2 b 5
#> 3 3 c 6
df2 <- data.frame(x = 2:4, z = 7:9)
full_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
nest_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
traceback()
#> 7: stop(fallback)
#> 6: signal_abort(cnd)
#> 5: abort(c(glue("Input columns in `{input}` must be unique."), x = glue("Problem with {err_vars(vars[dup])}.")))
#> 4: check_duplicate_vars(x_names, "x")
#> 3: join_cols(tbl_vars(x), tbl_vars(y), by = by, suffix = c("", ""), keep = keep)
#> 2: nest_join.data.frame(df, df2, by = "x")
#> 1: nest_join(df, df2, by = "x")
Here, both functions rely code written in join-cols.R. Both call join_cols() which in turn calls check_duplicate_vars(), which I've copied the source code from:
check_duplicate_vars <- function(vars, input, error_call = caller_env()) {
dup <- duplicated(vars)
if (any(dup)) {
bullets <- c(
glue("Input columns in `{input}` must be unique."),
x = glue("Problem with {err_vars(vars[dup])}.")
)
abort(bullets, call = error_call)
}
}
Although different in syntax from what you wrote, it's designed to provide the same behaviour, and shows it is possible to include in a package and no reason (from my understanding) not to do this. However, I would add a few syntax points based on your code above:
I would bundle the check (if() statement) inside the package with the error raising to reduce repeating yourself in other areas you use the function.
It's often nicer to include the name of the variable or argument passed in so the error message is explicit, such as in the dplyr example above. This makes the error more clear to the user what is causing the problem, in this case, that the x column is not unique in df.
The traceback showing #> Error in e(2) : oof, and error in your example is more obscure to the user, especially as e() is likely not exported in the NAMESPACE and they would need to parse the source code to understand where the error is generated. If you use stop(..., .call = FALSE) or passing the calling environment through the nested functions, like in join-cols.R, then you can avoid not helpful information in the traceback(). This is for instance suggested in Hadley's Advanced R:
By default, the error message includes the call, but this is typically not useful (and recapitulates information that you can easily get from traceback()), so I think it’s good practice to use call. = FALSE

Disable partial name idenfication of function arguments

I am trying to make a function in R that outputs a data frame in a standard way, but that also allows the user to have the personalized columns that he deams necessary (the goal is to make a data format for paleomagnetic data, for which there are common informations that everybody use, and some more unusual that the user might like to keep in the format).
However, I realized that if the user wants the header of his data to be a prefix of one of the defined arguments of the data formating function (e.g. via the 'sheep' argument, that is a prefix of the 'sheepc' argument, see example below), the function interprets it as the defined argument (through partial name identification, see http://adv-r.had.co.nz/Functions.html#lexical-scoping for more details).
Is there a way to prevent this, or to at least give a warning to the user saying that he cannot use this name ?
PS I realize this question is similar to Disabling partial variable names in subsetting data frames, but I would like to avoid toying with the options of the future users of my function.
fun <- function(sheeta = 1, sheetb = 2, sheepc = 3, ...)
{
# I use the sheeta, sheetb and sheepc arguments for computations
# (more complex than shown below, but here thet are just there to give an example)
a <- sum(sheeta, sheetb)
df1 <- data.frame(standard = rep(a, sheepc))
df2 <- as.data.frame(list(...))
if(nrow(df1) == nrow(df2)){
res <- cbind(df1, df2)
return(res)
} else {
stop("Extra elements should be of length ", sheep)
}
}
fun(ball = rep(1,3))
#> standard ball
#> 1 3 1
#> 2 3 1
#> 3 3 1
fun(sheep = rep(1,3))
#> Error in rep(a, sheepc): argument 'times' incorrect
fun(sheet = rep(1,3))
#> Error in fun(sheet = rep(1, 3)) :
#> argument 1 matches multiple formal arguments

From the language definition:
If the formal arguments contain ‘...’ then partial matching is only
applied to arguments that precede it.
fun <- function(..., sheeta = 1, sheetb = 2, sheepc = 3)
{<your function body>}
fun(sheep = rep(1,3))
# standard sheep
#1 3 1
#2 3 1
#3 3 1
Of course, your function should have assertion checks for the non-... parameters (see help("stopifnot")). You could also consider adding a . or _ to their tags to make name collisions less likely.
Edit:
"would it be possible to achieve the same effect without having the ... at the beginning ?"
Yes, here is a quick example with one parameter:
fun <- function(sheepc = 3, ...)
{
stopifnot("partial matching detected" = identical(sys.call(), match.call()))
list(...)
}
fun(sheep = rep(1,3))
# Error in fun(sheep = rep(1, 3)) : partial matching detected
fun(ball = rep(1,3))
#$ball
#[1] 1 1 1

What does builtins(internal = TRUE) return?

From ?builtins:
builtins(TRUE) returns an unsorted list of the names of internal functions, that is those which can be accessed as .Internal(foo(args ...)) for foo in the list.
I don't understand which functions are being returned.
I thought it would be all the closure functions in the base package that call .Internal().
However, the two sets don't match up.
base_objects <- mget(
ls(baseenv(), all.names = TRUE),
envir = baseenv()
)
internals <- names(
Filter(
assertive.types::is_internal_function,
base_objects
)
)
builtins_true <- builtins(internal = TRUE)
c(
both = length(intersect(internals, builtins_true)),
internals_not_builtins_true = length(setdiff(internals, builtins_true)),
builtins_true_not_internals = length(setdiff(builtins_true, internals))
)
## both internals_not_builtins_true builtins_true_not_internals
## 288 125 226
I also thought that it might be the values listed in src/main/names.c in R's source code, and there definitely seems to be some overlap with this, but it isn't exactly this list of values.
What is builtins() doing when you pass internal = TRUE?

Stibu's comment is a specific example of the general problem. ?builtins says that it fetches the names of the objects it returns directly from the symbol table (this is the C symbol table).
And builtins(TRUE) returns all the built-in objects callable via .Internal. That, however, doesn't mean there must be any function that calls .Internal(foo(args, ...)) for any foo.
Stibu gave one example: the internal function may not be called by an R function with the same name, as is the case for many generic functions where the default method calls .Internal.
Another example is something like .addCondHands and .addRestart, which are called by withCallingHandlers and withRestarts, respectively.
It's also possible that one R function calls multiple .Internal functions. I don't know of an example of that off the top of my head though.

After more digging, it seems that the list of functions is everything in the R_FunTab[] object in src/main/names.c where the second digit of the eval column is 1.
Here's a script to retrieve them.
library(stringi)
library(magrittr)
library(dplyr)
names.c <- readLines("https://raw.githubusercontent.com/wch/r-source/56a1b08b7282c5488acb71ee244098f4fd94f7c7/src/main/names.c")
fun_tab <- names.c[92:974] %>%
stri_replace_all_regex("^\\{", "") %>%
stri_replace_all_fixed("{PP", "PP") %>%
stri_replace_all_fixed("}},", "") %>%
stri_replace_all_fixed("\\t", "")
funs <- read.csv(text = fun_tab, header = FALSE, comment.char = "/")
cols <- names.c[86] %>%
stri_sub(4) %>%
stri_split_regex("\\t+") %>%
extract2(1) %>%
stri_trim()
colnames(funs) <- cols
funs$eval <- formatC(funs$eval, width = 3, flag = "0")
# Internal fns have 2nd digit of eval col == 1. See names.c[62:71]
internals <- funs %>% filter_(~ substring(eval, 2, 2) == 1)
I see slight differences when examining
setdiff(internals$printname, builtins(TRUE))
setdiff(builtins(TRUE), internals$printname)
For example builtins(TRUE) doesn't include shell.exec() if you aren't running Windows; mem.limits() was only recently removed from the devel branch of R, so it shows up in builtins(TRUE) for the current release version of R.

Not fully understanding how SE works across the dplyr verbs

I'm trying to understand how SE works in dplyr so I can use variables as inputs to these functions. I'm having some trouble with understanding how this works across the different functions and when I should be doing what. It would be really good to understand the logic behind this.
Here are some examples:
library(dplyr)
library(lazyeval)
a <- c("x", "y", "z")
b <- c(1,2,3)
c <- c(7,8,9)
df <- data.frame(a, b, c)
The following is exactly why i'd use SE and the *_ variant of a function. I want to change the name of what's being mutated based on another variable.
#Normal mutate - copies b into a column called new
mutate(df, new = b)
#Mutate using a variable column names. Use mutate_ and the unqouted variable name. Doesn't use the name "new", but use the string "col.new"
col.name <- "new"
mutate_(df, col.name = "b")
#Do I need to use interp? Doesn't work
expr <- interp(~(val = b), val = col.name)
mutate_(df, expr)
Now I want to filter in the same way. Not sure why my first attempt didn't work.
#Apply the same logic to filter_. the following doesn't return a result
val.to.filter <- "z"
filter_(df, "a" == val.to.filter)
#Do I need to use interp? Works. What's the difference compared to the above?
expr <- interp(~(a == val), val = val.to.filter)
filter_(df, expr)
Now I try to select_. Works as expected
#Apply the same logic to select_, an unqouted variable name works fine
col.to.select <- "b"
select_(df, col.to.select)
Now I move on to rename_. Knowing what worked for mutate and knowing that I had to use interp for filter, I try the following
#Now let's try to rename. Qouted constant, unqouted variable. Doesn't work
new.name <- "NEW"
rename_(df, "a" = new.name)
#Do I need an eval here? It worked for the filter so it's worth a try. Doesn't work 'Error: All arguments to rename must be named.'
expr <- interp(~(a == val), val = new.name)
rename_(df, expr)
Any tips on best practice when it comes to using variable names across the dplyr functions and when interp is required would be great.

The differences here are not related to which dplyr verb you are using. They are related to where you are trying to use the variable. You are mixing whether the variable is used as a function argument or not, and whether it should be interpreted as a name or as a character string.
Scenario 1:
You want to use your variable as an argument name. Such as in your mutate example.
mutate(df, new = b)
Here new is the name of a function argument, it is left of a =. The only way to do this is to use the .dots argument. Like
col.name <- 'new'
mutate_(df, .dots = setNames(list(~b), col.name))
Running just setNames(list(~b), col.name) shows you how we have an expression (~b), which is going right of the =, and the name is going left of the =.
Scenario 2:
You want to give only a variable as a function argument. This is the simplest case. Let's again use mutate(df, new = b), but in this case we want b to be variable. We could use:
v <- 'b'
mutate_(df, .dots = setNames(list(v), 'new'))
Or simply:
mutate_(df, new = b)
Scenario 3
You want to do some combinations of variable and fixed things. That is, your expression should only be partly variable. For this we use interp. For example, what if we would like to do something like:
mutate(df, new = b + 1)
But being able to change b?
v <- 'b'
mutate_(df, new = interp(~var + 1, var = as.name(v)))
Note that we as.name to make sure that we insert b into the expression, not 'b'.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Recode and Mutate_all in dplyr - r

Related

In R, what does parentheses followed by parentheses mean

Separating error message from error condition in package

Disable partial name idenfication of function arguments

What does builtins(internal = TRUE) return?

Not fully understanding how SE works across the dplyr verbs

Categories

Resources