Today I found that some of my stopifnot() tests are failing because the passed arguments evaluate to empty logical vectors.
Here is an example:
stopifnot(iris$nosuchcolumn == 2) # passes without error
This is very unintuitive and seems to contradict a few other behaviours. Consider:
isTRUE(logical())
> FALSE
stopifnot(logical())
# passes
So stopifnot() passes even when this argument is not TRUE.
But furthermore, the behaviour of the above is different with different types of empty vectors.
isTRUE(numeric())
> FALSE
stopifnot(numeric())
# Error: numeric() are not all TRUE
Is there some logic to the above, or should this be considered a bug?
The comments by akrun and r2evans are spot on.
However, to give details on why specifically this happens and why you're confused vs. isTRUE() behavior, note that stopifnot() checks for three things; the check is (where r is the result of the expression you pass):
if (!(is.logical(r) && !anyNA(r) && all(r)))
So, let's take a look:
is.logical(logical())
# [1] TRUE
!anyNA(logical())
# [1] TRUE
all(logical())
# [1] TRUE
is.logical(numeric())
# [1] FALSE
!anyNA(numeric())
# [1] TRUE
all(numeric())
# [1] TRUE
So, the only reason why logical() passes while numeric() fails is because numeric() is not "logical," as suggested by akrun. For this reason, you should avoid checks that may result in logical vectors of length 0, as suggested by r2evans.
Other answers cover the practical reasons why stopifnot behaves the way it does; but I agree with Karolis that the thread linked by Henrik adds the real explanation of why this is the case:
As author stopifnot(), I do agree with [OP]'s "gut feeling" [...] that
stopifnot(dim(x) == c(3,4)) [...][should] stop in the case
where x is a simple vector instead of a matrix/data.frame/... with
dimensions c(3,4) ... but [...] the gut feeling is wrong because of the fundamental lemma of logic: [...]
"All statements about elements of the empty set are true"
Martin Maechler, ETH Zurich
Also, [...], any() is to "|" what sum() is to "+" and what all() is to
"&" and prod() is to "*". All the operators have an identity element,
namely FALSE, 0, TRUE, and 1 respectively, and the generic convention
is that for an empty vector, we return the identity element, for the
reason given above.
Peter D.
Related
The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)
Suppose I have a list as follows
foo=list(bar="hello world")
I would like to check whether my list has a particular key.
I observe foo$bar2 will return NULL for any bar2 that is not equal to bar, so I figured I could check for whether the return value was null, but this does not seem to work:
if (foo$bar2==NULL) 1 # do something here
However, this gives the error:
Error in if (foo$bar2 == NULL) 1 : argument is of length zero
I then tried whether NULL is equivalent to false, like in C:
if (foo$bar2) 1 # do something here
This gives the same error.
I now have two questions. How can I check whether the list contains the key?
And how do I check whether an expression is null?
The notion of "keys" are called "names" in R.
if ("bar" %in% names(foo) ) { print("it's there") } # ....
They are stored in a special attribute named .Names and extracted with the names function:
dput(foo)
#structure(list(bar = "hello world"), .Names = "bar")
I offer a semantic caution here, because of a common source of confusion due to two distinct uses of the word: "names" in R: There are .Names-attributes, but there is an entirely different use of the word name in R that has to do with strings or tokens that have values independent of any inspection or extraction functions like $ or [. Any token that starts with a letter or a period and has no other special characters in it can be a valid name. One can test for it with the the function exists given a quoted version of its name:
exists("foo") # TRUE
#assume ‘foo’ is a list with a named element “bar”
exists(”bar”) # [1] FALSE (even though it’s a “name”
exists(foo$bar) # [1] FALSE
exists("foo$bar")# [1] FALSE
So the word name has two different meanings in R and you will need to be aware of this ambiguity to understand how the language is deployed. The .Names meaning refers to an attribute with special purposes, while the names-meaning refers to what is called a "language-object". The word symbol is a synonym for this second meaning of the word.
is.name( quote(foo) ) #[1] TRUE
To then show how your second question about testing for nullity might flow into this :
if( !is.null(foo$bar) ) { print("it's there") } # any TRUE value will be a 1
In the console, go ahead and try
> sum(sapply(1:99999, function(x) { x != as.character(x) }))
0
For all of values 1 through 99999, "1" == 1, "2" == 2, ..., 99999 == "99999" are TRUE. However,
> 100000 == "100000"
FALSE
Why does R have this quirky behavior, and is this a bug? What would be a workaround to, e.g., check if every element in an atomic character vector is in fact numeric? Right now I was trying to check whether x == as.numeric(x) for each x, but that fails on certain datasets due to the above problem!
Have a look at as.character(100000). Its value is not equal to "100000" (have a look for yourself), and R is essentially just telling you so.
as.character(100000)
# [1] "1e+05"
Here, from ?Comparison, are R's rules for applying relational operators to values of different types:
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of
precedence being character, complex, numeric, integer, logical and
raw.
Those rules mean that when you test whether 1=="1", say, R first converts the numeric value on the LHS to a character string, and then tests for equality of the character strings on the LHS and RHS. In some cases those will be equal, but in other cases they will not. Which cases produce inequality will be dependent on the current settings of options("scipen") and options("digits")
So, when you type 100000=="100000", it is as if you were actually performing the following test. (Note that internally, R may well/probably does use something different than as.character() to perform the conversion):
as.character(100000)=="100000"
# [1] FALSE
I have been using the R which function to remove rows from a data frame. I recently discovered that if the search term is NOT in the data.frame, the result is an empty character.
# 1: returns A-Q, S-Z (as expected)
LETTERS[-which(LETTERS == "R")]
# 2: returns "character(0)" (not what I would expect)
LETTERS[-which(LETTERS == "1")]
# 3: returns A-Z (expected)
LETTERS[which(LETTERS != "1")]
# 4: returns A-Q, S-Z (expected)
LETTERS[which(LETTERS != "R")]
Is the second example the expected behavior for -which() when the search term is not found? I have already switched my code to use the syntax in example 4, which seems safer, but I am just curious.
That is a well-known pitfall. When nothing matches the logical test the which-function returns numeric(0) and then "[" returns nothing instead of returning everything which would be expected. You can use:
LETTERS[ ! LETTERS == "1" ]
LETTERS[ ! LETTERS %in% "1" ]
There is another gotcha to be aware of and is the one that makes me choose to use which(). When using logical indexing an NA value used inside "[" will return a row. I generally do not want that so I use DFRM[ which(logical) ] although this seems to bother some people who say is is not needed. I just think they are working with small datasets and infrequently encounter the annoyance of seeing tens of thousands of NA-induced useless lines of output on their console. I never use the negated which version though.
Because of this:
which(LETTERS == '-1')
## integer(0)
and this:
(1:2)[integer(0)]
integer(0)
Instead of #4, use this:
LETTERS[LETTERS != "R"]
In example 2, which returns integer(0) (a zero-length integer vector) because no values are TRUE. A negative zero-length vector (-integer(0)) is still a zero-length vector. So you're essentially asking for the NULL element of LETTERS, which doesn't exist.
The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)