How does R interpret parentheses? Like most other programming languages these are built-in operators, and I normally use them without thinking.
However, I came across this example. Let's say we have a data.table in R, and I would like to apply a function on it's columns. Then I might write:
dt <- data.table(my_data)
important_cols <- c("col1", "col2", "col5")
dt[, (important_cols) := lapply(.SD, my_func), .SDcols = important_cols]
Obviously I can't neglect the parentheses:
dt[, important_cols := lapply(.SD, my_func), .SDcols = important_cols]
as that would introduce a new object called important_cols to my data.table, instead of modifying my existing columns in place.
My question is, why does putting ( ) around the vector "expand" it?
This question can probably better phrased and titled. But then I would have probably found the answer by Googling if I knew the terminology to employ while asking it, hence I'm here.
While we're on that topic, if someone could point out the differences between [ ], { }, etc., and how they should be used, that would be appreciated too :)
A special feature of R (compared to e.g. C++) is that the various parentheses are actually functions. What this means is that (a) and a are different expressions. The second is just a, while the first is the function ( called with an argument a. Here are a few expressions trees for you to compare:
as.list(substitute( a ))
#[[1]]
#a
as.list(substitute( (a) ))
#[[1]]
#`(`
#
#[[2]]
#a
as.list(substitute( sqrt(a) ))
#[[1]]
#sqrt
#
#[[2]]
#a
Notice how similar the last trees are - in one the function is sqrt, in the other it's "(". In most places in R, the "(" function doesn't do anything, it just returns the same expression, but in the particular case of data.table, it is "overridden" (in quotes because that's not exactly how it's done, but in spirit it is) to do a variety of useful operations.
And here's one more demo to hopefully cement the point:
`(` = function(x) x*x
2
#[1] 2
(2)
#[1] 4
((2))
#[1] 16
Related
I'm trying to get the output "1*2*4*5" from (function(x) Reduce(paste0(toString("*")),x))(c(1,2,4,5)), but no matter how I manipulate Reduce, paste0, and the asterisks, I'm either getting error messages or the asterisks being treated as multiplication (giving 40). Where am I going wrong?
Reduce uses a function with two arguments to which it applies the previous result and the next element of the vector. Therefore, you need a function of both x and y:
Reduce(function(x,y)paste0(x,"*",y),c(1,2,4,5))
#[1] "1*2*4*5"
As an aside, you can provide an initial value to be applied as x for the first element of the vector with init =.
Reduce(function(x,y)paste0(x,"*",y),c(1,2,4,5), init = 0)
#[1] "0*1*2*4*5"
One thing you may have tried was this:
Reduce(paste0("*"),c(1,2,4,5))
#[1] 40
This applies the multiplication operator to x and y, because paste0("*") evaluates to "*".
Another base R option is to use paste within gsub, e.g.,
x <- 1:5
gsub("\\s","*",Reduce(paste,x))
which gives
> gsub("\\s","*",Reduce(paste,x))
[1] "1*2*3*4*5"
KISS method:
(with improvements as suggested by #nicola)
bar <- as.character(1:5)
paste0(bar,sep="",collapse='*')
#[1] "1*2*3*4*5"
The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)
I'm reading the advanced R introduction by Hadley Wickham, where he states that [ (and +, -, {, etc) are functions, so that [ can be used in this manner
> x <- list(1:3, 4:9, 10:12)
> sapply(x, "[", 2)
[1] 2 5 11
Which is perfectly fine and understandable. But if [ is the function required to subset, does ] have another use rather than a syntactical one?
I found that:
> `]`
Error: object ']' not found
so I assume there is no other use for it?
This is the fundamental difference between syntax and semantics. Semantics require that — in R — things like subsetting and if etc are functions. That’s why R defines functions `[`, `if` etc.
And then there’s syntax. And R’s syntax dictates that the syntax for if is either if (condition) expression or if (condition) expression else expression. Likewise, the syntax for subsetting in R is obj[args…]. That is, ] is simply a syntactic element and it has no semantic equivalent, no corresponding function (same as else).
To make this perhaps even clearer:
[ and ] are syntactic elements in R that delimit a subsetting expression.
By contrast, `[` (note the backticks!) is a function that implements the subsetting operation.
Somehow though, I was expecting ] to be a syntactical element, by default: indexing from the end. So I define it myself in my code:
"]" <- function(x,y) if (y <= length(x)) x[length(x)+1-y] else NA
With the given example, then:
sapply(x, "]", 1)
[1] 3 9 12
sapply(x, "]", 2)
[1] 2 8 11
I have come across the popular data.table package and one thing in particular intrigued me. It has an in-place assignment operator
:=
This is not defined in base R. In fact if you didn't load the data.table package, it would have raised an error if you had tried to used it (e.g., a := 2) with the message:
Error: could not find function ":="
Also, why does := work? Why does R let you define := as infix operator while every other infix function has to be surrounded by %%, e.g.
`:=` <- function(a, b) {
paste(a,b)
}
"abc" := "def"
Clearly it's not meant to be an alternative syntax to %function.name% for defining infix functions. Is data.table exploiting some parsing quirks of R? Is it a hack? Will it be "patched" in the future?
It is something that the base R parser recognizes and seems to parse as a left assign (at least in terms or order of operations and such). See the C source code for more details.
as.list(parse(text="a:=3")[[1]])
# [[1]]
# `:=`
#
# [[2]]
# a
#
# [[3]]
# [1] 3
As far as I can tell it's undocumented (as far as base R is concerned). But it is a function/operator you can change the behavior of
`:=`<-function(a,b) {a+b}
3 := 7
# [1] 10
As you can see there really isn't anything special about the ":" part itself. It just happens to be the start of a compound token.
It's not just a colon operator but rather := is a single operator formed by the colon and equal sign (just as the combination of "<" and "-" forms the assignment operator in base R). The := operator is an infix function that is defined to be part of the evaluation of the "j" argument inside the [.data.table function. It creates or assigns a value to a column designated by its LHS argument using the result of evaluating its RHS.
I tried to do this simple search but couldn't find anything on the percent (%) symbol in R.
What does %in% mean in the following code?
time(x) %in% time(y) where x and y are matrices.
How do I look up help on %in% and similar functions that follow the %stuff% pattern, as I cannot locate the help file?
Related questions:
What does eg %+% do? in R
The R %*% operator
What does %*% mean in R
What does %||% do in R?
What does %>% mean in R
I didn't think GSee's or Sathish's answers went far enough because "%" does have meaning all by itself and not just in the context of the %in% operator. It is the mechanism for defining new infix operators by users. It is a much more general issue than the virtues of the %in% infix operator or its more general prefix ancestor match. It could be as simple as making a pairwise "s"(um) operator:
`%s%` <- function(x,y) x + y
Or it could be more interesting, say making a second derivative operator:
`%DD%` <- function(expr, nam="x") { D(D( bquote(.(expr)), nam), nam) }
expression(x^4) %DD% "x"
# 4 * (3 * x^2)
The %-character also has importance in the parsing of Date, date-time, and C-type format functions like strptime, formatC and sprintf.
Since that was originally written we have seen the emergence of the magrittr package with the dplyr elaboration that demonstrates yet another use for %-flanked operators.
So the most general answer is that % symbols are handled specially by the R parser. Since the parser is used to process plotmath expressions, you will also see extensive options for graphics annotations at the ?plotmath help page.
%op% denotes an infix binary operator. There are several built-in operators using %, and you can also create your own.
(A single % sign isn't a keyword in R. You can see a list of keywords on the ?Reserved help page.)
How do I get help on binary operators?
As with anything that isn't a standard variable name, you have to to enclose the term in quotes or backquotes.
?"%in%"
?`%in%`
Credit: GSee's answer.
What does %in% do?
As described on the ?`%in%` help page (which is actually the ?match help page since %in% is really only an infix version of match.),
[%in%] returns a logical vector indicating if there is a match or not for its left operand
It is most commonly used with categorical variables, though it can be used with numbers as well.
c("a", "A") %in% letters
## [1] TRUE FALSE
1:4 %in% c(2, 3, 5, 7, 11)
## [1] FALSE TRUE TRUE FALSE
Credit: GSee's answer, Ari's answer, Sathish's answer.
How do I create my own infix binary operators?
These are functions, and can be defined in the same way as any other function, with a couple of restrictions.
It's a binary opertor, so the function must take exactly two arguments.
Since the name is non-standard, it must be written with quotes or backquotes.
For example, this defines a matrix power operator.
`%^%` <- function(x, y) matrixcalc::matrix.power(x, y)
matrix(1:4, 2) %^% 3
Credit: BondedDust's answer, Ari's answer.
What other % operators are there?
In base R:
%/% and %% perform integer division and modular division respectively, and are described on the ?Arithmetic help page.
%o% gives the outer product of arrays.
%*% performs matrix multiplication.
%x% performs the Kronecker product of arrays.
In ggplot2:
%+% replaces the data frame in a ggplot.
%+replace% modifies theme elements in a ggplot.
%inside% (internal) checks for values in a range.
%||% (internal) provides a default value in case of NULL values. This function also appears internally in devtools, reshape2, roxygen2 and knitr. (In knitr it is called %n%.)
In magrittr:
%>% pipes the left-hand side into an expression on the right-hand side.
%<>% pipes the left-hand side into an expression on the right-hand side, and then assigns the result back into the left-hand side object.
%T>% pipes the left-hand side into an expression on the right-hand side, which it uses only for its side effects, returning the left-hand side.
%,% builds a functional sequence.
%$% exposes columns of a data.frame or members of a list.
In data.table:
%between% checks for values in a range.
%chin% is like %in%, optimised for character vectors.
%like% checks for regular expression matches.
In Hmisc:
%nin% returns the opposite of %in%.
In devtools:
%:::% (internal) gets a variable from a namespace passed as a string.
In sp:
%over% performs a spatial join (e.g., which polygon corresponds to some points?)
In rebus:
%R% concatenates elements of a regex object.
More generally, you can find all the operators in all the packages installed on your machine using:
library(magrittr)
ip <- installed.packages() %>% rownames
(ops <- setNames(ip, ip) %>%
lapply(
function(pkg)
{
rdx_file <- system.file("R", paste0(pkg, ".rdx"), package = pkg)
if(file.exists(rdx_file))
{
rdx <- readRDS(rdx_file)
fn_names <- names(rdx$variables)
fn_names[grepl("^%", fn_names)]
}
}
) %>%
unlist
)
Put quotes around it to find the help page. Either of these work
> help("%in%")
> ?"%in%"
Once you get to the help page, you'll see that
‘%in%’ is currently defined as
‘"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0’
Since time is a generic, I don't know what time(X2) returns without knowing what X2 is. But, %in% tells you which items from the left hand side are also in the right hand side.
> c(1:5) %in% c(3:8)
[1] FALSE FALSE TRUE TRUE TRUE
See also, intersect
> intersect(c(1:5), c(3:8))
[1] 3 4 5
More generally, %foo% is the syntax for a binary operator. Binary operators in R are really just functions in disguise, and take two arguments (the one before and the one after the operator become the first two arguments of the function).
For example:
> `%in%`(1:5,4:6)
[1] FALSE FALSE FALSE TRUE TRUE
While %in% is defined in base R, you can also define your own binary function:
`%hi%` <- function(x,y) cat(x,y,"\n")
> "oh" %hi% "my"
oh my
%in% is an operator used to find and subset multiple occurrences of the same name or value in a matrix or data frame.
For example 1: subsetting with the same name
set.seed(133)
x <- runif(5)
names(x) <- letters[1:5]
x[c("a", "d")]
# a d
# 0.5360112 0.4231022
Now you change the name of "d" to "a"
names(x)[4] <- "a"
If you try to extract the similar names and its values using the previous subscript, it will not work. Notice the result, it does not have the elements of [1] and [4].
x[c("a", "a")]
# a a
# 0.5360112 0.5360112
So, you can extract the two "a"s from different position in a variable by using %in% binary operator.
names(x) %in% "a"
# [1] TRUE FALSE FALSE TRUE FALSE
#assign it to a variable called "vec"
vec <- names(x) %in% "a"
#extract the values of two "a"s
x[vec]
# a a
# 0.5360112 0.4231022
Example 2: Subsetting multiple values from a column
Refer this site for an example