What's the difference between the in and the %in% operator in R? Why do I sometimes need the percentage signs and other times I do not?
The 3 following objects are all functions :
identity
%in%
for
We can call them this way :
`identity`(1)
#> [1] 1
`%in%`(1, 1:2)
#> [1] TRUE
`for`(x, seq(3), print("yes"))
#> [1] "yes"
#> [1] "yes"
#> [1] "yes"
But usually we don't!
"identity" is syntactic (i.e. it's a "regular" name, doesn't contain weird symbols etc), AND it is not a protected word so we can skip the tick marks and call just :
identity(1)
%in% is not syntactic but it starts and ends with "%" so it can be used in infix form. you could define your own `%fun%` <-function(x,y) ... and use it this way to, so we would call :
1 %in% 1:2
for is a control flow construct, like if, while and repeat, all of those are functions with a given number of arguments, but they come in the language with more convenient ways to call them than the above. here we'd do :
for (x in seq(3)) print("yes")
in is just used to parse the code, it's not a function here (just like else isn't either.
?`%in%` will show you what the function does.
Depending on how you define it, there is no in operator in R, only an %in% operator. Instead, in is “syntactic sugar” as part of the syntax for the for loop.
By contrast, %in% is an actual operator defined in R which tests whether the left-hand expression is contained in the right-hand expression. As other operators in R, %in% is a regular function and can be called as such:
if (`%in%`(x, seq(3, 5))) message("yes")
… or it can be redefined:
`%in%` = function (x, table) {
message("I redefined %in%!")
match(x, table, nomatch = 0L) > 0L
}
if (5 %in% 1 : 10) message("yes")
# I redefined %in%!
# yes
Usage-wise, I have figured out the answer: I can only use in when I loop through everything, and %in%for checking whether something is contained in something else, e.g.
for (x in seq(3)){
if (x %in% seq(3,5)) print("yes")
}
Related
How to recode vector to NA if it is zero length (numeric(0)), but return the vector unchanged if not? Preferably in tidyverse using pipes.
My attempt:
library(tidyverse)
empty_numeric <- numeric(0)
empty_numeric |>
if_else(length(.) > 0, true = NA, false = . )
#> Error: `condition` must be a logical vector, not a double vector.
You can’t use a vectorised if_else here because its output is the same length as its input (i.e. always a single element). Instead, you’ll need to use conventional if. And neither will work with the built-in |> pipe since that restricts how the call can be formed (in particular, it only allows substituting the LHS into top-level arguments, and only once). By contrast, we need to repeat the LHS, and substitute it into a nested expression.
Using the ‘magrittr’ pipe operator works, however:
myvec %>% {if (length(.) > 0L) . else NA}
Or, if you prefer writing this using function call syntax:
myvec %>% `if`(length(.) > 0L, ., NA)
To be able to use the native pipe, we need to wrap the logic into a function:
na_if_null = function (x) {
if (length(x) == 0L) as(NA, class(x)) else x
}
myvec |> na_if_null()
(The as cast is there to ensure that the return type is the same as the input type, regardless of the type of x.)
I have a function with NA as a default, but if not NA should be a character vector not restricted to size 1. I have a check to validate these, but is.na produces the standard warning when the vector is a character vector with length greater than 1.
so_function <- function(x = NA) {
if (!(is.na(x) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
so_function(c("A", "B"))
#> Warning in if (!(is.na(x) | is.character(x))) {: the condition has length >
#> 1 and only the first element will be used
An option to prevent the warning I came up with was to use identical:
so_function <- function(x = NA) {
if (!(identical(x, NA) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
My issue here is that this function will generally be taking Excel sheet data loaded into R as inputs, and the NA values generated from that are often NA_character_, NA_integer_, and NA_real_, so identical(x, NA) is often FALSE when I actually need it to be TRUE.
For the broader context, I am experiencing this issue for S3 classes I am creating for a package, and the function below approximates how I am validating multiple attributes for that class, which is when the warnings are appearing. Because of this, I am trying to avoid suppressing warnings as the solution, so would be interested to know what best practice exists to solve this issue.
Edit
In order to make use cases clearer, this is validating attributes for a class, where I want to ensure the attribute is either a single NA value, or a character vector of any length:
so_function(NA_character_) # should pass
so_function(NA_integer_) # should pass
so_function(c(NA, NA)) # should fail
so_function(c("A", "B")) # should pass
so_function(c(1, 2, 3)) # should fail
The length warning comes from the use of if, which expects a length 1 vector, and is.na which is vectorised.
You could use any or all around the is.na to compress it to a length 1 vector but there may be edge cases where it doesn't work as you expect so I would use shortcircuit evaluation to check it is length 1 on the is.na check:
so_function <- function(x = NA) {
if (!((length(x)==1 && is.na(x)) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
so_function(NA_character_) # should pass
so_function(NA_integer_) # should pass
so_function(c(NA, NA)) # should fail
Error in so_function(c(NA, NA)) : This was just an example for you SO!
so_function(c("A", "B")) # should pass
so_function(c(1, 2, 3)) # should fail
Error in so_function(c(1, 2, 3)) : This was just an example for you SO!
Another option is to use NULL as the default value instead.
I don't think the problem arises from is.na() - it is a vectorized function which produces a vector as an output. is.character(x) on the other hand is not vectorized so it only will output a single value.
You can leverage apply-like functions to overcome this e.g.
sapply(c("a", NA, 5), is.character)
if also functions similarly - you are better off using ifelse for by-element comparison.
I don't think I quite grasped what you what do to with you function but it could rewritten like this:
so_function_2 <- function(x = NA) {
condit <- !(is.na(x) | sapply(x, is.character))
ifelse(condit, "This was just an example for you SO!", "FALSE")
}
I want to create an overloaded function that behaves differently given the arguments provided. For this, I need to check if the argument given is an existing object (e.g. data frame, list, integer) or an abstract formula (e.g. a + b, 2 * 4, y ~ x + y etc.). Below I paste what I would like it to recognize:
df <- data.frame(a, b)
f(df) # data.frame
f(data.frame(a, b)) # data frame
f(a + b) # expression
f("a + b") # character
f(2 * 2 + 7) # expression
f(I(2 * 2)) # integer
Is it possible to construct such a function? How? Unfortunately I wasn't able to find any references on the web or in the books on R programming I know.
The general way of overloading functions in R would be something like this:
f <- function(x) UseMethod("f")
f.default <- function(x) eval(substitute(x))
f.data.frame <- function(x) print("data frame")
It gives:
> f(df)
[1] "data frame"
> f(2 + 2)
[1] 4
> f(list(a, b))
[[1]]
[1] 1
[[2]]
[1] 2
So the problem with doing it like this is that I would have to name all the possible other data types rather than checking if x is an expression.
The same is with using:
f2 <- function(x) typeof(substitute(x))
because it evaluates function calls and expressions in the same manner:
> f2(2 + 2)
[1] "language"
> f2(df)
[1] "symbol"
> f2(data.frame(a, b))
[1] "language"
while I would like it to differentiate between list(a, b) and 2 + 2, because the first one is a list, and the second one is an expression.
I know that it would be easy with a classic R formula that is easily recognizable by R, but is it possible with different input?
Thanks!
It is the principle of object oriented langage in R. You should learn a bit more about it here:
https://www.stat.auckland.ac.nz/~stat782/downloads/08-Objects.pdf
http://brainimaging.waisman.wisc.edu/~perlman/R/A1%20Introduction%20to%20object-oriented%20programming.pdf
There are two types of objects in R: S3 and S4. S3 objects are easier to implement and more flexible. Their use is sufficient for what you want to do. You can use S3 generic functions.
I strongly advise you to learn more about these S3 and S4 classes, but to make it short, you can just look at the class of parameter you give to function f. This can be done thanks to function class.
You can separate your function f in different cases:
f <- function(a){
if (class(a) == 'data.frame'){
# do things...
}
else if (class(a) == 'formula'){
# do things...
}
else if (class(a) == 'integer'){
# do things...
}
else {
stop("Class no supported")
}
}
OK, it seems I tried to complicate it in a greater extent than I had to. The simple answer is just:
if (tryCatch(is.data.frame(x), error=function(z) FALSE)) {
# here do stuff with a data.frame
} else {
# here check the expression using some regular expressions etc.
}
Tracemem is doing what I need it to, but it is also producing distracting visual clutter. Here is a simple example.
a<-1
b<-2
dummyfunction<-function(x,y){return(sum(x,y))}
dummyfunction(a,b)
[1] 3
Now, I want to do something more complex, first tracemem to see if the inputs are duplicated...
dummyfunction2<-function(x,y){if (tracemem(x)==tracemem(y)){return("Input vectors are identical")}
if(sum(x %in% y)>=length(x) & sum(y %in% x)>=length(y)){print("Something something.")}
return(sum(x,y))}
This does what I want if the inputs are duplicated...
dummyfunction2(a,a)
[1] "Input vectors are identical"
When they're not duplicated, though the function still works, it spews a bunch of confusing information.
dummyfunction2(a,b)
tracemem[0x0000000009824470 -> 0x000000000a7ced80]: match %in% dummyfunction2
tracemem[0x0000000009824500 -> 0x000000000a7cedb0]: match %in% dummyfunction2
tracemem[0x0000000009824500 -> 0x000000000a7cef90]: match %in% dummyfunction2
tracemem[0x0000000009824470 -> 0x000000000a7cc1a8]: match %in% dummyfunction2
[1] 3
I'm hoping to convince non-R users to try using a function with this issue, and output like this will certainly scare them off.
What is the most elegent way to remove this visual clutter without supressing potentially informative warnings. etc that may crop up in other portions of the function?
From http://stat.ethz.ch/R-manual/R-patched/library/base/html/tracemem.html :
"This function marks an object so that a message is printed whenever the internal code copies the object."
You could stick untracemem into the function to get around it:
dummyfunction3<-function(x,y){
if (tracemem(x)==tracemem(y)){return("Input vectors are identical")}
untracemem(x)
untracemem(y)
if(sum(x %in% y)>=length(x) & sum(y %in% x)>=length(y)){print("Something something.")}
return(sum(x,y))}
output:
a <- 1
b <- 2
dummyfunction3(a,a)
# [1] "Input vectors are identical"
dummyfunction3(a,b)
# [1] 3
Don't use tracemem(). Instead you could try pryr::address() which
just returns the memory address of the input.
devtools::install_github("hadley/pryr")
library(pryr)
x <- 1:10
y <- x
address(x)
## [1] "0x100a568c8"
address(y)
## [1] "0x100a568c8"
I'm trying to find the names of all the functions used in an arbitrary legal R expression, but I can't find a function that will flag the below example as a function instead of a name.
test <- expression(
this_is_a_function <- function(var1, var2){
this_is_a_function(var1-1, var2)
})
all.vars(test, functions = FALSE)
[1] "this_is_a_function" "var1" "var2"
all.vars(expr, functions = FALSE) seems to return functions declarations (f <- function(){}) in the expression, while filtering out function calls ('+'(1,2), ...).
Is there any function - in the core libraries or elsewhere - that will flag 'this_is_a_function' as a function, not a name? It needs to work on arbitrary expressions, that are syntactically legal but might not evaluate correctly (e.g '+'(1, 'duck'))
I've found similar questions, but they don't seem to contain the solution.
If clarification is needed, leave a comment below. I'm using the parser package to parse the expressions.
Edit: #Hadley
I have expressions with contain entire scripts, which usually consist of a main function containing nested function definitions, with a call to the main function at the end of the script.
Functions are all defined inside the expressions, and I don't mind if I have to include '<-' and '{', since I can easy filter them out myself.
The motivation is to take all my R scripts and gather basic statistics about how my use of functions has changed over time.
Edit: Current Solution
A Regex-based approach grabs the function definitions, combined with the method in James' comment to grab function calls. Usually works, since I never use right-hand assignment.
function_usage <- function(code_string){
# takes a script, extracts function definitions
require(stringr)
code_string <- str_replace(code_string, 'expression\\(', '')
equal_assign <- '.+[ \n]+<-[ \n]+function'
arrow_assign <- '.+[ \n]+=[ \n]+function'
function_names <- sapply(
strsplit(
str_match(code_string, equal_assign), split = '[ \n]+<-'),
function(x) x[1])
function_names <- c(function_names, sapply(
strsplit(
str_match(code_string, arrow_assign), split = '[ \n]+='),
function(x) x[1]))
return(table(function_names))
}
Short answer: is.function checks whether a variable actually holds a function. This does not work on (unevaluated) calls because they are calls. You also need to take care of masking:
mean <- mean (x)
Longer answer:
IMHO there is a big difference between the two occurences of this_is_a_function.
In the first case you'll assign a function to the variable with name this_is_a_function once you evaluate the expression. The difference is the same difference as between 2+2 and 4.
However, just finding <- function () does not guarantee that the result is a function:
f <- function (x) {x + 1} (2)
The second occurrence is syntactically a function call. You can determine from the expression that a variable called this_is_a_function which holds a function needs to exist in order for the call to evaluate properly. BUT: you don't know whether it exists from that statement alone. however, you can check whether such a variable exists, and whether it is a function.
The fact that functions are stored in variables like other types of data, too, means that in the first case you can know that the result of function () will be function and from that conclude that immediately after this expression is evaluated, the variable with name this_is_a_function will hold a function.
However, R is full of names and functions: "->" is the name of the assignment function (a variable holding the assignment function) ...
After evaluating the expression, you can verify this by is.function (this_is_a_function).
However, this is by no means the only expression that returns a function: Think of
f <- function () {g <- function (){}}
> body (f)[[2]][[3]]
function() {
}
> class (body (f)[[2]][[3]])
[1] "call"
> class (eval (body (f)[[2]][[3]]))
[1] "function"
all.vars(expr, functions = FALSE) seems to return functions declarations (f <- function(){}) in the expression, while filtering out function calls ('+'(1,2), ...).
I'd say it is the other way round: in that expression f is the variable (name) which will be asssigned the function (once the call is evaluated). + (1, 2) evaluates to a numeric. Unless you keep it from doing so.
e <- expression (1 + 2)
> e <- expression (1 + 2)
> e [[1]]
1 + 2
> e [[1]][[1]]
`+`
> class (e [[1]][[1]])
[1] "name"
> eval (e [[1]][[1]])
function (e1, e2) .Primitive("+")
> class (eval (e [[1]][[1]]))
[1] "function"
Instead of looking for function definitions, which is going to be effectively impossible to do correctly without actually evaluating the functions, it will be easier to look for function calls.
The following function recursively spiders the expression/call tree returning the names of all objects that are called like a function:
find_calls <- function(x) {
# Base case
if (!is.recursive(x)) return()
recurse <- function(x) {
sort(unique(as.character(unlist(lapply(x, find_calls)))))
}
if (is.call(x)) {
f_name <- as.character(x[[1]])
c(f_name, recurse(x[-1]))
} else {
recurse(x)
}
}
It works as expected for a simple test case:
x <- expression({
f(3, g())
h <- function(x, y) {
i()
j()
k(l())
}
})
find_calls(x)
# [1] "{" "<-" "f" "function" "g" "i" "j"
# [8] "k" "l"
Just to follow up here as I have also been dealing with this problem: I have now created a C-level function to do this using code very similar to the C implementation of all.names and all.vars in base R. It however only works with objects of type "language" i.e. function calls, not type "expression". Demonstration:
ex = quote(sum(x) + mean(y) / z)
all.names(ex)
#> [1] "+" "sum" "x" "/" "mean" "y" "z"
all.vars(ex)
#> [1] "x" "y" "z"
collapse::all_funs(ex)
#> [1] "+" "sum" "/" "mean"
Created on 2022-08-17 by the reprex package (v2.0.1)
This generalizes to arbitrarily complex nested calls.