I often write R code where I test the length of a vector, the number of rows in a data frame, or the dimensions of a matrix, for example if (length(myVector) == 1). While poking around in some base R code, I noticed that in such comparisons values are explicitly stated as integers, usually using the 'L' suffix, for example if (nrow(data.frame) == 5L). Explicit integers are also sometimes used for function arguments, for example these statements from the cor function: x <- matrix(x, ncol = 1L) and apply(u, 2L, rank, na.last = "keep"). When should integers be explicitly specified in R? Are there any potentially negative consequences from not specifying integers?
You asked:
Are there any potentially negative consequences from not specifying
integers?
There are situations where it is likely to matter more. From Chambers Software for Data Analysis p193:
Integer values will be represented exactly as "double" numbers so long
as the absolute value of the integer is less than 2^m, the length of
the fractional part of the representation (2^54 for 32-bit machines).
It's not hard to see how if you calculated a value it might look like an integer but not quite be one:
> (seq(-.45,.45,.15)*100)[3]
[1] -15
> (seq(-.45,.45,.15)*100)[3] == -15L
[1] FALSE
However, it's harder to come up with an example of explicitly typing in an integer and having it come up not quite an integer in the floating point representation, until you get into the larger values Chambers describes.
Using 1L etc is programmatically safe, as in it is explicit as to what is meant, and does not rely on any conversions etc.
When writing code interactively, it can be easy to notice errors and fix along the way, however if you are writing a package (even base R), it will be safer to be explicit.
When you are considering equality, using floating point numbers will cause precision issues See this FAQ.
Explicitly specifying integers avoids this, as nrow and length, and the index arguments to apply return or require integers.
Related
I found this previous thread here:
Range standardization (0 to 1) in R
This leads me to my question: when building a function to perform a calculation across all values
in a vector, my understanding was that this scenario is when the use of for-loops would be necessary (because said calculation is being all applied to all vector values). However, apparently that is not the case. What am I misunderstanding?
All of the basic arithmetic operations in R, and most of the basic numerical/mathematical functions, are natively vectorized; that is, they operate position-by-position on elements of vectors (or pairs of elements from matched vectors). So for example if you compute
c(1,3,5) + c(2,4,7)
You don't need an explicit for loop (although there is one in the underlying C code!) to get c(3,7,12) as the answer.
In addition, R has vector recycling rules; any time you call an operation with two vectors, the shorter automatically gets recycled to match the length of the longer one. R doesn't have scalar types, so a single number is stored as a numeric vector of length 1. So if we compute
(x-min(x))/(max(x)-min(x))
max(x) and min(x) are both of length 1, so the denominator is also of length 1. Subtracting min(x) from x, min(x) gets replicated/recycled to a vector the same length as x, and then the pairwise subtraction is done. Finally, the numerator (length == length(x)) is divided by the denominator (length 1) following similar rules.
In general, exploiting this built-in vectorization in R is much faster than writing your own for-loop, so it's definitely worth trying to get the hang of when operations can be vectorized.
This kind of vectorization is implemented in many higher-level languages/libraries that are specialized for numerical computation (pandas and numpy in Python, Matlab, Julia, ...)
I would like to create "matrix-like" objects that are not necessarily "proper" matrices. But what, precisely, does "matrix-like" mean?
Example 1
> image(1:9)
Error in image.default(1:9) : argument must be matrix-like
Example 2
In the R Language Definition (in v3.3.1, §3.4.3) it is an hapax legomenon (emphasis added):
[An] example of a class method for [ is… if two indices are supplied (even if one is empty) it creates matrix-like indexing…
Example 3
The title of help(scale) reads, "Scaling and Centering of Matrix-like Objects" (emphasis added). There seems to be a clue there:
numeric-alike means that as.numeric(.) will be applied successfully if is.numeric(.) is not true.
matrix-like data is data in tabular form, with the dim attribute set. But length(dim(obj)) must be equal to 2, matrices are 2-dim objects.
Quoting from Advanced R by Hadley Wickham:
Matrices and arrays
Adding a dim attribute to an atomic vector allows it to behave like a
multi-dimensional array. A special case of the array is the matrix,
which has two dimensions. Matrices are used commonly as part of the
mathematical machinery of statistics. Arrays are much rarer, but worth
being aware of.
Matrices and arrays are created with matrix() and array(), or by using
the assignment form of dim()
See also the help("dim") page.
Example:
x <- 1:9
image(x) # error
y <- 1:9
dim(y) <- c(3, 3)
image(y)
I am aware of at least two distinct uses of the equals sign in the R-programming language:
(1) as a deprecated assignment operator, i.e. x = 3 instead of x <- 3.
(2) for passing values of arguments to functions, e.g. ggplot(df, aes(x = length, y = width))
Do either of these operators correspond to symmetric relations (in the sense of mathematics)?
The 'equals' operator == does (I think), which is why it corresponds most closely to the use of the equals sign in mathematics (which is always a symmetric relation).
But for example if one tried to run ggplot(df, aes(length = x, width = y) one would get an error, and one would also get an error trying to run 3 = x.
Thus, is it true that, unlike in mathematics, the equals sign in R is not a symmetric relation? Is that why <- is preferred by some for assignment, because it better conveys the lack of symmetry?
Bonus question: are there other programming languages where the equals sign does not correspond to a symmetric relation? PowerShell (I have never heard of it before) might be one.
The = operator is not symmetric in R. When it comes to assignment, = is basically a function that takes a symbol and a value and assigns that value to that symbol. When it comes to named parameters, it's really just part of the syntax of naming a parameter.
<- is preferred for assignment simply because it has an unambiguous meaning.
I have the following mathematical formula that I want to program as efficiently as possible in R.
$\sum_{i=1}^{N}(x_i-\bar x)(y_i-\bar y)$
Let's say we have the following example data:
x = c(1,5,7,10,11)
y = c(2,4,8,9,12)
How can I easily get this sum with this data without making a separate function?
Isn't there a package or a function that can compute these mathematical sums?
Use the sum command and vectorized operations: sum((x-mean(x))*(y-mean(y)))
The key revelation here is that the sum function is just taking the sum over the argument (vector, matrix, whatever). In this case, it's sufficient to give it a vector, and in this case, the vector expression is a little more complicated than sum(z), but notice that (x-mean(x))*(y-mean(y)) evaluates to z, so the fact that the command is slightly ornate doesn't really change how the function works. This is true in many places, not just the sum command.
I'm maintaining some code where a previous author has used statements like:
x <- 1. * a + b
or:
if (y < 1.e-3)
or:
z[z < 0.] <- 0
or:
f <- aa + bb / 2.
The dots in these statements aren't appearing within parameter or function names, and they're not appearing within formulas, so I'm having trouble figuring out whether they have any significance. As far as I can tell, similar statements evaluated with constants substituted for the variables don't evaluate any differently.
I thought that perhaps the periods were inserted to coerce the result to a float, but the statements don't seem to be ambiguous in this regard, and I wasn't under the impression that R needed any help in this regard with different numeric types. he only other explanation I can come up with is that the values originally were floats and the previous author got lazy removing the decimal points when they were changed to integers.
Is there any other possible use for the dot that could be relevant, or am I safe cleaning up these statements?