Going through metaprogramming sections of Hadley's book Advanced R 2nd ed, I have quite a bit of a tough time understanding the concept. I have been programming with R for a while but this is the first time I came across the concept of metaprogramming. This exercise question in particular confuses me
"The following two calls print the same, but are actually different:
(a <- expr(mean(1:10)))
#> mean(1:10)
(b <- expr(mean(!!(1:10))))
#> mean(1:10)
identical(a, b)
#> [1] FALSE
What’s the difference? Which one is more natural?"
when I eval them they both returns the same
> eval(a)
[1] 5.5
> eval(b)
[1] 5.5
when I look inside the a and b object the second object does print differently, but I am not sure what does this mean in terms of their difference:
> a[[2]]
1:10
> b[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
also if I just run them without eval(expr(...)) then it will return differently:
mean(1:10)
[1] 5.5
mean(!!(1:10))
[1] 1
My guess is that without expr(...) !!(1:10) act as a double negation which with coercion essentially forcing all the numbers to be 1, hence mean of 1.
My questions are:
Why does the !! acts differently with and without the expr(...) ? I would expect eval(expr(mean(!!(1:10)))) to return the same as mean(!!(1:10)) but this is not so
I still do not quite fully grasp what is the difference between a object and b object ?
thank you in advance
!! here is used not as double negation, but the unquote operator from rlang.
Unquoting is one inverse of quoting. It allows you to selectively
evaluate code inside expr(), so that expr(!!x) is equivalent to x.
The difference between a and b is that the argument remains as an unevaluated call in a, while it is evaluated in b:
class(a[[2]])
[1] "call"
class(b[[2]])
[1] "integer"
The a behaviour may be an advantage in some circumstances because it delays evaluation, or a disadvantage for the same reason. When it is a disadvantage, it is of the the cause of much frustration. If the argument was a larger vector, the size of b would increase, while a would stay the same.
See section 19.4 of Advanced R for more details.
Here is the difference. When we negate (!) an integer vector, numbers other than 0 are converted to FALSE and 0 to TRUE. With another negate ie. double (!!), the FALSE are changed to TRUE and viceversa
!0:5
#[1] TRUE FALSE FALSE FALSE FALSE FALSE
!!0:5
#[1] FALSE TRUE TRUE TRUE TRUE TRUE
With the OP's example, it is all TRUE
!!1:10
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and TRUE/FALSE can be otherwise 1/0
as.integer(!!1:10)
#[1] 1 1 1 1 1 1 1 1 1 1
thus the mean would be 1
mean(!!1:10)
#[1] 1
Regarding the 'a' vs. 'b'
str(a)
#language mean(1:10)
str(b)
#language mean(1:10)
Both are language objects and it will be evaluated to get the mean of numbers 1:10
all.equal(a, b)
#[1] TRUE
If we need to get the mean of 10 numbers, the first one is the correct way.
We could evaluate the second option correctly i.e. getting a mean value of 1, by quoteing
eval(quote(mean(!!(1:10))))
#[1] 1
eval(quote(mean(1:10)))
#[1] 5.5
!! has special meaing when used inside expr.
outside expr you will get different results because
!! is a double negation
even inside expr the two versions are different because
1:10 is an expression resulting in an integer vector
when evaluated, while !!(1:10) is the result of
evaluating this same expression.
an expression and its result after it is evaluated are
different things.
Related
a <- character()
b <- "SO is great"
any(a == b)
#> [1] FALSE
all(a == b)
#> [1] TRUE
The manual describes ‘any’ like this
Given a set of logical vectors, is at least one of the values true?
So, not even one value in the comparison a == b yields TRUE.
If that is the case how can ‘any’ return FALSE while ‘all’ returns TRUE? ‘all’
is described as Given a set of logical vectors, are all of the values true?.
In a nutshell: all values are TRUE and none are TRUE at the same time?
I am not expert but that looks odd.
Questions:
Is there a reasonable explanation for or is it just some quirk of R?
What are the ways around this?
Created on 2021-01-08 by the reprex package (v0.3.0)
Usually, when comparing a == b the elements of the shorter vector are recycled as necessary. However, in your case a has no elements, so no recycling occurs and the result is an empty logical vector.
The results of any(a == b) and all(a == b) are coherent with the logical quantifiers for all and exists. If you quantify on an empty range, for all gives the neutral element for logical conjunction (AND) which is TRUE, while exists gives the neutral element for logical disjunction (OR) which is FALSE.
As to how avoid these situations, check if the vectors have the same size, since comparing vectors of different lengths rarely makes sense.
Regarding question number 2, I know of identical. It works well in all the situations I can think of.
a <- "a"
b <- "b"
identical(a, b) # FALSE >> works
#> [1] FALSE
a <- character(0)
identical(a, b) # FALSE >> works
#> [1] FALSE
a <- NA
identical(a, b) # FALSE >> works
#> [1] FALSE
a <- NULL
identical(a, b) # FALSE >> works
#> [1] FALSE
a <- b
identical(a, b) # TRUE >> works
#> [1] TRUE
identical seems to be a good workaround though it still feels like a workaround to a part-time developer like me. Are there more solutions? Better ones? And why does R behave like this in the first place (see question)?
Created on 2021-01-08 by the reprex package (v0.3.0)
Regarding question 1)
I have no idea whether I am correct, but here are my thoughts:
In R all() is the compliment of any(). For consistency, all(logical(0)) is true. So, you're situation you are capturing this unique case.
In mathematics, this is analogous to a set being both open and closed. I'm not a computer scientist, so I can't really talk to why one of the greybeards from way back when implemented this in either R or S.
regarding question 2)
I think the other responses have answered this well.
Another solution provided by the shiny package
is isTruthy().
The package introduced the concept of truthy/falsy that “generally indicates
whether a value, when coerced to a base::logical(), is TRUE or FALSE”
(see the documentation).
require(shiny, quietly = TRUE)
a <- "a"
b <- "b"
isTruthy(a == b) # FALSE >> works
#> [1] FALSE
a <- character(0)
isTruthy(a == b) # FALSE >> works
#> [1] FALSE
a <- NA
isTruthy(a == b) # FALSE >> works
#> [1] FALSE
a <- NULL
isTruthy(a == b) # FALSE >> works
#> [1] FALSE
a <- b
isTruthy(a == b) # TRUE >> works
#> [1] TRUE
One of the advantages is that you can use other operators like %in%
or match(), too.
The situation in R is that you never know what a function will return when it fails.
Some functions return NA, others NULL, and yet others vectors with length == 0.
isTruthy() makes it easier to handle the diversity.
Unfortunately, when one does not write a shiny app it hardly makes sense to load the package because - aside from isTruthy - shiny only adds a large bunch of unneeded Web App features.
Created on 2021-01-10 by the reprex package (v0.3.0)
Please see below example. Logical operator < works with chracter and numeric, and that in some case it returns TRUE. I'm cofused becase in my sense it should return NA or at least FALSE. Is this by R design ?
And I'd be grateful if you show me an simple alternative method. (I can solve this problem using custom function checking class before logical judgement. Are there better solutions ?)
"(abc" < 0 # TRUE
"(abc" < -1 # FALSE
"abc" < 9999999999 # FALSE
"abc" < Inf # TRUE
Most likely R is performing a character conversion of the RHS of your inequalities. That is, R is actually making the following comparisons:
"(abc" < "0"
"(abc" < "-1"
"abc" < "999999999"
"abc" < Inf
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
Note that the outputs agree with your current output, which uses number literals on the RHS.
(reproducible example added)
I cannot grasp enough why the following is FALSE (I aware they are double and integer resp.):
identical(1, as.integer(1)) # FALSE
?identical reveals:
num.eq:
logical indicating if (double and complex non-NA) numbers should be compared using == (‘equal’), or by bitwise comparison. The latter (non-default)
differentiates between -0 and +0.
sprintf("%.8190f", as.integer(1)) and sprintf("%.8190f", 1) return exactly equal bit pattern. So, I think that at least one of the following must return TRUE. But, I get FALSE in each of the following:
identical(1, as.integer(1), num.eq=TRUE) # FALSE
identical(1, as.integer(1), num.eq=FALSE) # FALSE
I consider like that now: If sprintf is a notation indicator, not the storage indicator, then this means identical() compares based on storage. i.e.
identical(bitpattern1, bitpattern1bitpattern2) returns FALSE. I could not find any other logical explanation to above FALSE/FALSE situation.
I do know that in both 32bit/64bit architecture of R, integers are stored as 32bit.
They are not identical precisely because they have different types. If you look at the documentation for identical you'll find the example identical(1, as.integer(1)) with the comment ## FALSE, stored as different types. That's one clue. The R language definition reminds us that:
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length 1; there are no more basic types (emphasis mine).
So, basically everything is a vector with a type (that's also why [1] shows up every time R returns something). You can check this by explicitly creating a vector with length 1 by using vector, and then comparing it to 0:
x <- vector("double", 1)
identical(x, 0)
# [1] TRUE
That is to say, both vector("double", 1) and 0 output vectors of type "double" and length == 1.
typeof and storage.mode point to the same thing, so you're kind of right when you say "this means identical() compares based on storage". I don't think this necessarily means that "bit patterns" are being compared, although I suppose it's possible. See what happens when you change the storage mode using storage.mode:
## Assign integer to x. This is really a vector length == 1.
x <- 1L
typeof(x)
# [1] "integer"
identical(x, 1L)
# [1] TRUE
## Now change the storage mode and compare again.
storage.mode(x) <- "double"
typeof(x)
# [1] "double"
identical(x, 1L) # This is no longer TRUE.
# [1] FALSE
identical(x, 1.0) # But this is.
# [1] TRUE
One last note: The documentation for identical states that num.eq is a…
logical indicating if (double and complex non-NA) numbers should be compared using == (‘equal’), or by bitwise comparison.
So, changing num.eq doesn't affect any comparison involving integers. Try the following:
# Comparing integers with integers.
identical(+0L, -0L, num.eq = T) # TRUE
identical(+0L, -0L, num.eq = F) # TRUE
# Comparing integers with doubles.
identical(+0, -0L, num.eq = T) # FALSE
identical(+0, -0L, num.eq = F) # FALSE
# Comparing doubles with doubles.
identical(+0.0, -0.0, num.eq = T) # TRUE
identical(+0.0, -0.0, num.eq = F) # FALSE
Given the string patt:
patt = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of shorter substrings str_col:
str_col = substring(patt,1:(nchar(patt)-9),10:nchar(patt))
which we want to match against a subject1:
subject1 = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in patt as a wildcard (match to any letter in subject1), so all substrings in str_col match to subject1.
I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here str_col) into a dictionary of class pdict using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, str_col[15] won't match if you use a trusted band of minimum width = 1 like:
> PDict(str_col[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band. Notice that the strings here are DNA sequences, so "N" is a code for "match to A, C, G, or T".
> PDict(str_col[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
- with a head of width 4
- with a Trusted Band of width 1
- with a tail of width 5
Is there any way to circumvent this limitation of Biostrings? I also tried to perform such task using R base functions, but I couldn't come up with anything.
I reckon that you'll need matching against some other wild cards from the IUPAC ambiguity code at one point, no?
If you need perfect matches and base functions are enough for you, you can use the same trick as the function glob2rx() : simply use conversion with gsub() to construct the matching patterns. An example:
IUPACtoRX <- function(x){
p <- gsub("N","\\[ATCG\\]",x)
p <- gsub("Y","\\[CT\\]",p) #match any pyrimidine
# add the ambiguity codes you want here
p
}
Obviously you need a line for every ambiguity you want to program in, but it's pretty straightforward I'd say.
Doing this, you can then eg do something like:
> sapply(str_col, function(i) grepl(IUPACtoRX(i),subject1) )
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
TRUE TRUE TRUE TRUE
To find the number of matches, you can use eg gregexpr():
> sapply(str_col,function(i) sum(gregexpr(IUPACtoRX(i),subject1) > 0 ))
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
1 1 1 1 1 1 1
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
1 1 1 1 1 1 1
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
1 1 1 1 1 1 1
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
1 1 1 1
I am dealing with the roots of a seconf order polynomial and I only wnat to store the complex roots ( the ones that only have imaginary part). When I do:
Im(roots))
[1] -1.009742e-28 1.009742e-28
So the program says that is not equal to 0. And so the condition
Im(roots) ==0
Is never true. And I am storing all the roots that are only real also.
Thanks!
This is probably a case of FAQ 7.31 (dealing with representation and comparison of floating point numbers). The all.equal function is available in such cases. Best use would be
> isTRUE(all.equal(roots[1], 0) )
[1] TRUE
> isTRUE(all.equal(roots[2], 0) )
[1] TRUE
Read ?all.equal for all the gory details.
DWin is almost certainly right that you're getting numbers with magnitudes that small due to the imprecision of floating point arithmetic.
To correct for it in your application, you might want to use zapsmall(x, digits). zapsmall() is nice utility function that will round to 0 numbers that are very close to (within digits decimal places of) it.
Here, riffing off an example from its help page:
thetas <- 0:4*pi/2
coords <- exp(1i*thetas)
coords
# [1] 1+0i 0+1i -1+0i 0-1i 1-0i
## Floating point errors obscure the big picture
Im(coords) == 0
# [1] TRUE FALSE FALSE FALSE FALSE
Re(coords) == 0
# [1] FALSE FALSE FALSE FALSE FALSE
## zapsmall makes it all better
Im(zapsmall(coords)) == 0
# [1] TRUE FALSE TRUE FALSE TRUE
Re(zapsmall(coords)) == 0
# [1] FALSE TRUE FALSE TRUE FALSE