I am aware of at least two distinct uses of the equals sign in the R-programming language:
(1) as a deprecated assignment operator, i.e. x = 3 instead of x <- 3.
(2) for passing values of arguments to functions, e.g. ggplot(df, aes(x = length, y = width))
Do either of these operators correspond to symmetric relations (in the sense of mathematics)?
The 'equals' operator == does (I think), which is why it corresponds most closely to the use of the equals sign in mathematics (which is always a symmetric relation).
But for example if one tried to run ggplot(df, aes(length = x, width = y) one would get an error, and one would also get an error trying to run 3 = x.
Thus, is it true that, unlike in mathematics, the equals sign in R is not a symmetric relation? Is that why <- is preferred by some for assignment, because it better conveys the lack of symmetry?
Bonus question: are there other programming languages where the equals sign does not correspond to a symmetric relation? PowerShell (I have never heard of it before) might be one.
The = operator is not symmetric in R. When it comes to assignment, = is basically a function that takes a symbol and a value and assigns that value to that symbol. When it comes to named parameters, it's really just part of the syntax of naming a parameter.
<- is preferred for assignment simply because it has an unambiguous meaning.
Related
What are the differences between the assignment operators = and <- in R?
I know that operators are slightly different, as this example shows
x <- y <- 5
x = y = 5
x = y <- 5
x <- y = 5
# Error in (x <- y) = 5 : could not find function "<-<-"
But is this the only difference?
The difference in assignment operators is clearer when you use them to set an argument value in a function call. For example:
median(x = 1:10)
x
## Error: object 'x' not found
In this case, x is declared within the scope of the function, so it does not exist in the user workspace.
median(x <- 1:10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
In this case, x is declared in the user workspace, so you can use it after the function call has been completed.
There is a general preference among the R community for using <- for assignment (other than in function signatures) for compatibility with (very) old versions of S-Plus. Note that the spaces help to clarify situations like
x<-3
# Does this mean assignment?
x <- 3
# Or less than?
x < -3
Most R IDEs have keyboard shortcuts to make <- easier to type. Ctrl + = in Architect, Alt + - in RStudio (Option + - under macOS), Shift + - (underscore) in emacs+ESS.
If you prefer writing = to <- but want to use the more common assignment symbol for publicly released code (on CRAN, for example), then you can use one of the tidy_* functions in the formatR package to automatically replace = with <-.
library(formatR)
tidy_source(text = "x=1:5", arrow = TRUE)
## x <- 1:5
The answer to the question "Why does x <- y = 5 throw an error but not x <- y <- 5?" is "It's down to the magic contained in the parser". R's syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether = or <- was used.
To understand what is happening, you need to know that assignment silently returns the value that was assigned. You can see that more clearly by explicitly printing, for example print(x <- 2 + 3).
Secondly, it's clearer if we use prefix notation for assignment. So
x <- 5
`<-`(x, 5) #same thing
y = 5
`=`(y, 5) #also the same thing
The parser interprets x <- y <- 5 as
`<-`(x, `<-`(y, 5))
We might expect that x <- y = 5 would then be
`<-`(x, `=`(y, 5))
but actually it gets interpreted as
`=`(`<-`(x, y), 5)
This is because = is lower precedence than <-, as shown on the ?Syntax help page.
What are the differences between the assignment operators = and <- in R?
As your example shows, = and <- have slightly different operator precedence (which determines the order of evaluation when they are mixed in the same expression). In fact, ?Syntax in R gives the following operator precedence table, from highest to lowest:
…
‘-> ->>’ rightwards assignment
‘<- <<-’ assignment (right to left)
‘=’ assignment (right to left)
…
But is this the only difference?
Since you were asking about the assignment operators: yes, that is the only difference. However, you would be forgiven for believing otherwise. Even the R documentation of ?assignOps claims that there are more differences:
The operator <- can be used anywhere,
whereas the operator = is only allowed at the top level (e.g.,
in the complete expression typed at the command prompt) or as one
of the subexpressions in a braced list of expressions.
Let’s not put too fine a point on it: the R documentation is wrong. This is easy to show: we just need to find a counter-example of the = operator that isn’t (a) at the top level, nor (b) a subexpression in a braced list of expressions (i.e. {…; …}). — Without further ado:
x
# Error: object 'x' not found
sum((x = 1), 2)
# [1] 3
x
# [1] 1
Clearly we’ve performed an assignment, using =, outside of contexts (a) and (b). So, why has the documentation of a core R language feature been wrong for decades?
It’s because in R’s syntax the symbol = has two distinct meanings that get routinely conflated (even by experts, including in the documentation cited above):
The first meaning is as an assignment operator. This is all we’ve talked about so far.
The second meaning isn’t an operator but rather a syntax token that signals named argument passing in a function call. Unlike the = operator it performs no action at runtime, it merely changes the way an expression is parsed.
So how does R decide whether a given usage of = refers to the operator or to named argument passing? Let’s see.
In any piece of code of the general form …
‹function_name›(‹argname› = ‹value›, …)
‹function_name›(‹args›, ‹argname› = ‹value›, …)
… the = is the token that defines named argument passing: it is not the assignment operator. Furthermore, = is entirely forbidden in some syntactic contexts:
if (‹var› = ‹value›) …
while (‹var› = ‹value›) …
for (‹var› = ‹value› in ‹value2›) …
for (‹var1› in ‹var2› = ‹value›) …
Any of these will raise an error “unexpected '=' in ‹bla›”.
In any other context, = refers to the assignment operator call. In particular, merely putting parentheses around the subexpression makes any of the above (a) valid, and (b) an assignment. For instance, the following performs assignment:
median((x = 1 : 10))
But also:
if (! (nf = length(from))) return()
Now you might object that such code is atrocious (and you may be right). But I took this code from the base::file.copy function (replacing <- with =) — it’s a pervasive pattern in much of the core R codebase.
The original explanation by John Chambers, which the the R documentation is probably based on, actually explains this correctly:
[= assignment is] allowed in only two places in the grammar: at the top level (as a complete program or user-typed expression); and when isolated from surrounding logical structure, by braces or an extra pair of parentheses.
In sum, by default the operators <- and = do the same thing. But either of them can be overridden separately to change its behaviour. By contrast, <- and -> (left-to-right assignment), though syntactically distinct, always call the same function. Overriding one also overrides the other. Knowing this is rarely practical but it can be used for some fun shenanigans.
Google's R style guide simplifies the issue by prohibiting the "=" for assignment. Not a bad choice.
https://google.github.io/styleguide/Rguide.xml
The R manual goes into nice detail on all 5 assignment operators.
http://stat.ethz.ch/R-manual/R-patched/library/base/html/assignOps.html
x = y = 5 is equivalent to x = (y = 5), because the assignment operators "group" right to left, which works. Meaning: assign 5 to y, leaving the number 5; and then assign that 5 to x.
This is not the same as (x = y) = 5, which doesn't work! Meaning: assign the value of y to x, leaving the value of y; and then assign 5 to, umm..., what exactly?
When you mix the different kinds of assignment operators, <- binds tighter than =. So x = y <- 5 is interpreted as x = (y <- 5), which is the case that makes sense.
Unfortunately, x <- y = 5 is interpreted as (x <- y) = 5, which is the case that doesn't work!
See ?Syntax and ?assignOps for the precedence (binding) and grouping rules.
According to John Chambers, the operator = is only allowed at "the top level," which means it is not allowed in control structures like if, making the following programming error illegal.
> if(x = 0) 1 else x
Error: syntax error
As he writes, "Disallowing the new assignment form [=] in control expressions avoids programming errors (such as the example above) that are more likely with the equal operator than with other S assignments."
You can manage to do this if it's "isolated from surrounding logical structure, by braces or an extra pair of parentheses," so if ((x = 0)) 1 else x would work.
See http://developer.r-project.org/equalAssign.html
From the official R documentation:
The operators <- and = assign into the environment in which they
are evaluated. The operator <- can be used anywhere, whereas the
operator = is only allowed at the top level (e.g., in the
complete expression typed at the command prompt) or as one of the
subexpressions in a braced list of expressions.
This may also add to understanding of the difference between those two operators:
df <- data.frame(
a = rnorm(10),
b <- rnorm(10)
)
For the first element R has assigned values and proper name, while the name of the second element looks a bit strange.
str(df)
# 'data.frame': 10 obs. of 2 variables:
# $ a : num 0.6393 1.125 -1.2514 0.0729 -1.3292 ...
# $ b....rnorm.10.: num 0.2485 0.0391 -1.6532 -0.3366 1.1951 ...
R version 3.3.2 (2016-10-31); macOS Sierra 10.12.1
I am not sure if Patrick Burns book R inferno has been cited here where in 8.2.26 = is not a synonym of <- Patrick states "You clearly do not want to use '<-' when you want to set an argument of a function.". The book is available at https://www.burns-stat.com/documents/books/the-r-inferno/
There are some differences between <- and = in the past version of R or even the predecessor language of R (S language). But currently, it seems using = only like any other modern language (python, java) won't cause any problem. You can achieve some more functionality by using <- when passing a value to some augments while also creating a global variable at the same time but it may have weird/unwanted behavior like in
df <- data.frame(
a = rnorm(10),
b <- rnorm(10)
)
str(df)
# 'data.frame': 10 obs. of 2 variables:
# $ a : num 0.6393 1.125 -1.2514 0.0729 -1.3292 ...
# $ b....rnorm.10.: num 0.2485 0.0391 -1.6532 -0.3366 1.1951 ...
Highly recommended! Try to read this article which is the best article that tries to explain the difference between those two:
Check https://colinfay.me/r-assignment/
Also, think about <- as a function that invisibly returns a value.
a <- 2
(a <- 2)
#> [1] 2
See: https://adv-r.hadley.nz/functions.html
I have the following mathematical formula that I want to program as efficiently as possible in R.
$\sum_{i=1}^{N}(x_i-\bar x)(y_i-\bar y)$
Let's say we have the following example data:
x = c(1,5,7,10,11)
y = c(2,4,8,9,12)
How can I easily get this sum with this data without making a separate function?
Isn't there a package or a function that can compute these mathematical sums?
Use the sum command and vectorized operations: sum((x-mean(x))*(y-mean(y)))
The key revelation here is that the sum function is just taking the sum over the argument (vector, matrix, whatever). In this case, it's sufficient to give it a vector, and in this case, the vector expression is a little more complicated than sum(z), but notice that (x-mean(x))*(y-mean(y)) evaluates to z, so the fact that the command is slightly ornate doesn't really change how the function works. This is true in many places, not just the sum command.
I have a function f(v,u) and I defined function
solutionf(u) := fsolve(f(v,u)=v);
I need to plot solutionf(u) depending on u but just
plot(solutionf(u), u = 0 .. 0.4e-1)
gives me an error
Error, (in fsolve) number of equations, 1, does not match number of variables, 2
However I can always take the value solutionf(x) at any x.
Is there simple way to plot this? Or I have to make own for loop over u, take value at every point and plot interploating values?
This is one of the most-often-asked Maple questions. Your error is caused by what is known as premature evaluation, the expression solutionf(u) being evaluated before u has been given a numeric value.
There are several ways to avoid premature evaluation. The simplest is probably to use forward single quotes:
plot('solutionf(u)', u= 0..0.4e-1);
I'm confused with when a value is treated as a variable, and when as a string in R. In Ruby and Python, I'm used to a string always having to be quoted, and an unquoted string is always treated as a variable. Ie.
a["hello"] => a["hello"]
b = "hi"
a[b] => a["hi"]
But in R, this is not the case, for example
a$b < c(1,2,3)
b here is the value/name of the column, not the variable b.
c <- "b"
a$c => column not found (it's looking for column c, not b, which is the value of the variable c)
(I know that in this specific case I can use a[c], but there are many other cases. Such as ggplot(a, aes(x=c)) - I want to plot the column that is the value of c, not with the name c)...
In other StackOverflow questions, I've seen things like quote, substitute etc mentioned.
My question is: Is there a general way of "expanding" a variable and making sure the value of the variable is used, instead of the name of the variable? Or is that just not how things are done in R?
In your example, a$b is syntatic sugar for a[["b"]]. That's a special feature of the $ symbol when used with lists. The second form does what you expect - a[[b]] will return the element of a whose name == the value of the variable b, rather than the element whose name is "b".
Data frames are similar. For a data frame a, the $ operator refers to the column names. So a$b is the same as a[ , "b"]. In this case, to refer to the column of a indicated by the value of b, use a[, b].
The reason that what you posted with respect to the $ operator doesn't work is quite subtle and is in general quite different to most other situations in R where you can just use a function like get which was designed for that purpose. However, calling a$b is equivalent to calling
`$`(a , b)
This reminds us, that in R, everything is an object. $ is a function and it takes two arguments. If we check the source code we can see that calling a$c and expecting R to evaluate c to "b" will never work, because in the source code it states:
/* The $ subset operator.
We need to be sure to only evaluate the first argument.
The second will be a symbol that needs to be matched, not evaluated.
*/
It achieves this using the following:
if(isSymbol(nlist) )
SET_STRING_ELT(input, 0, PRINTNAME(nlist));
else if(isString(nlist) )
SET_STRING_ELT(input, 0, STRING_ELT(nlist, 0));
else {
errorcall(call,_("invalid subscript type '%s'"),
type2char(TYPEOF(nlist)));
}
nlist is the argument you passed do_subset_3 (the name of the C function $ maps to), in this case c. It found that c was a symbol, so it replaces it with a string but does not evaluate it. If it was a string then it is passed as a string.
Here are some links to help you understand the 'why's and 'when's of evaluation in R. They may be enlightening, they may even help, if nothing else they will let you know that you are not alone:
http://developer.r-project.org/nonstandard-eval.pdf
http://journal.r-project.org/2009-1/RJournal_2009-1_Chambers.pdf
http://www.burns-stat.com/documents/presentations/inferno-ish-r/
In that last one, the most important piece is bullet point 2, then read through the whole set of slides. I would probably start with the 3rd one, then the 1st 2.
These are less in the spirit of how to make a specific case work (as the other answers have done) and more in the spirit of what has lead to this state of affairs and why in some cases it makes sense to have standard nonstandard ways of accessing variables. Hopefully understanding the why and when will help with the overall what to do.
If you want to get the variable named "b", use the get function in every case. This will substitute the value of b for get(b) wherever it is found.
If you want to play around with expressions, you need to use quote(), substitute(), bquote(), and friends like you mentioned.
For example:
x <- quote(list(a = 1))
names(x) # [1] "" "a"
names(x) <- c("", a)
x # list(foo = 1)
And:
c <- "foo"
bquote(ggplot(a, aes(x=.(c)))) # ggplot(a, aes(x = "foo"))
substitute(ggplot(a, aes(x=c)), list(c = "foo"))
I often write R code where I test the length of a vector, the number of rows in a data frame, or the dimensions of a matrix, for example if (length(myVector) == 1). While poking around in some base R code, I noticed that in such comparisons values are explicitly stated as integers, usually using the 'L' suffix, for example if (nrow(data.frame) == 5L). Explicit integers are also sometimes used for function arguments, for example these statements from the cor function: x <- matrix(x, ncol = 1L) and apply(u, 2L, rank, na.last = "keep"). When should integers be explicitly specified in R? Are there any potentially negative consequences from not specifying integers?
You asked:
Are there any potentially negative consequences from not specifying
integers?
There are situations where it is likely to matter more. From Chambers Software for Data Analysis p193:
Integer values will be represented exactly as "double" numbers so long
as the absolute value of the integer is less than 2^m, the length of
the fractional part of the representation (2^54 for 32-bit machines).
It's not hard to see how if you calculated a value it might look like an integer but not quite be one:
> (seq(-.45,.45,.15)*100)[3]
[1] -15
> (seq(-.45,.45,.15)*100)[3] == -15L
[1] FALSE
However, it's harder to come up with an example of explicitly typing in an integer and having it come up not quite an integer in the floating point representation, until you get into the larger values Chambers describes.
Using 1L etc is programmatically safe, as in it is explicit as to what is meant, and does not rely on any conversions etc.
When writing code interactively, it can be easy to notice errors and fix along the way, however if you are writing a package (even base R), it will be safer to be explicit.
When you are considering equality, using floating point numbers will cause precision issues See this FAQ.
Explicitly specifying integers avoids this, as nrow and length, and the index arguments to apply return or require integers.