%in% vs '==' when comparing date with date_as_string

%in% vs '==' when comparing date with date_as_string - r

I'm confused why %in% and '==' give different results here:
day_string <- '2017-07-20'
day_date <- as.Date(day_string)
day_string == day_date #TRUE
day_string %in% day_date #FALSE
From %in% help:
%in% is currently defined as "%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
So if I understand things correctly, since match coerces date to character (but first to numeric),
day_string %in% day_date
is translated to
match(day_string, as.character(as.numeric(day_date)), nomatch = 0) > 0
However '==' help says it also coerces different types. What does '==' actually do in the example above and why it behaves differently than %in%?

From the help of ?== "If the two arguments are atomic vectors of different types, one is coerced to the type of the other"
So I guess while == has two same type vectors to compare, %in% is trying to compare a date with a character.
However, this only happens with date Vs character, i.e.
as.character(5) %in% 5
#[1] TRUE
as.factor('abc') %in% 'abc'
#[1] TRUE
5 %in% 5L
#[1] TRUE
In the case of the OP, as #Cath mentions, df_date is first converted to numeric and then to character so the final comparing is,
as.character(as.numeric(day_date))
#[1] "17367"
as.character(as.numeric(day_date)) %in% day_string
#[1] FALSE
Double Checking,
'17367' %in% as.Date(day_string)
#[1] TRUE

The relational operator "==" is (as noted in ?"==") a generic function that has/can have methods defined either directly ("==.class") or through the Ops generic group (Ops.class). Such functions is highly probable that have methods to account for R's base classes like the "Date" class and could work as expected, as is the case with "==" through ?Ops.Date. We can see if the "Date" class is supported by a generic function by methods(class = "Date").
On the other hand, match (and its wrapper "%in%") is not generic and could not necessarily be expected to account for the "class" attribute of its arguments (even for R's own classes). In cases of classes where it does account for is because it was explicitly designed to account for a specific class and such a fact may be documented in the respective help page. This is the case (has not always been), for example, with the "POSIXlt" class (day_string %in% as.POSIXlt(day_date) works as desired). So, "%in%" ignores the class of "day_date" and all it sees is that it's been passed a typeof(day_date) (unclass(day_date)) and a typeof(day_string) where appropriate coercions are made (say, something like as.character.default(day_date)) according to ?match.

Related

Why does empty logical vector pass the stopifnot() check?

Today I found that some of my stopifnot() tests are failing because the passed arguments evaluate to empty logical vectors.
Here is an example:
stopifnot(iris$nosuchcolumn == 2) # passes without error
This is very unintuitive and seems to contradict a few other behaviours. Consider:
isTRUE(logical())
> FALSE
stopifnot(logical())
# passes
So stopifnot() passes even when this argument is not TRUE.
But furthermore, the behaviour of the above is different with different types of empty vectors.
isTRUE(numeric())
> FALSE
stopifnot(numeric())
# Error: numeric() are not all TRUE
Is there some logic to the above, or should this be considered a bug?

The comments by akrun and r2evans are spot on.
However, to give details on why specifically this happens and why you're confused vs. isTRUE() behavior, note that stopifnot() checks for three things; the check is (where r is the result of the expression you pass):
if (!(is.logical(r) && !anyNA(r) && all(r)))
So, let's take a look:
is.logical(logical())
# [1] TRUE
!anyNA(logical())
# [1] TRUE
all(logical())
# [1] TRUE
is.logical(numeric())
# [1] FALSE
!anyNA(numeric())
# [1] TRUE
all(numeric())
# [1] TRUE
So, the only reason why logical() passes while numeric() fails is because numeric() is not "logical," as suggested by akrun. For this reason, you should avoid checks that may result in logical vectors of length 0, as suggested by r2evans.

Other answers cover the practical reasons why stopifnot behaves the way it does; but I agree with Karolis that the thread linked by Henrik adds the real explanation of why this is the case:
As author stopifnot(), I do agree with [OP]'s "gut feeling" [...] that
stopifnot(dim(x) == c(3,4)) [...][should] stop in the case
where x is a simple vector instead of a matrix/data.frame/... with
dimensions c(3,4) ... but [...] the gut feeling is wrong because of the fundamental lemma of logic: [...]
"All statements about elements of the empty set are true"
Martin Maechler, ETH Zurich
Also, [...], any() is to "|" what sum() is to "+" and what all() is to
"&" and prod() is to "*". All the operators have an identity element,
namely FALSE, 0, TRUE, and 1 respectively, and the generic convention
is that for an empty vector, we return the identity element, for the
reason given above.
Peter D.

How to specify "does not contain" in dplyr filter

I am quite new to R.
Using the table called SE_CSVLinelist_clean, I want to extract the rows where the Variable called where_case_travelled_1 DOES NOT contain the strings "Outside Canada" OR "Outside province/territory of residence but within Canada". Then create a new table called SE_CSVLinelist_filtered.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
where_case_travelled_1 %in% -c('Outside Canada','Outside province/territory of residence but within Canada'))
The code above works when I just use "c" and not "-c".
So, how do I specify the above when I really want to exclude rows that contains that outside of the country or province?

Note that %in% returns a logical vector of TRUE and FALSE. To negate it, you can use ! in front of the logical statement:
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
!where_case_travelled_1 %in%
c('Outside Canada','Outside province/territory of residence but within Canada'))
Regarding your original approach with -c(...), - is a unary operator that "performs arithmetic on numeric or complex vectors (or objects which can be coerced to them)" (from help("-")). Since you are dealing with a character vector that cannot be coerced to numeric or complex, you cannot use -.

Try putting the search condition in a bracket, as shown below. This returns the result of the conditional query inside the bracket. Then test its result to determine if it is negative (i.e. it does not belong to any of the options in the vector), by setting it to FALSE.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
(where_case_travelled_1 %in% c('Outside Canada','Outside province/territory of residence but within Canada')) == FALSE)

Just be careful with the previous solutions since they require to type out EXACTLY the string you are trying to detect.
Ask yourself if the word "Outside", for example, is sufficient. If so, then:
data_filtered <- data %>%
filter(!str_detect(where_case_travelled_1, "Outside")
A reprex version:
iris
iris %>%
filter(!str_detect(Species, "versicolor"))

Quick fix. First define the opposite of %in%:
'%ni%' <- Negate("%in%")
Then apply:
SE_CSVLinelist_filtered <- filter(
SE_CSVLinelist_clean,
where_case_travelled_1 %ni% c('Outside Canada',
'Outside province/territory of residence but within Canada'))

Why does 1..99,999 == "1".."99,999" in R, but 100,000 != "100,000"?

In the console, go ahead and try
> sum(sapply(1:99999, function(x) { x != as.character(x) }))
0
For all of values 1 through 99999, "1" == 1, "2" == 2, ..., 99999 == "99999" are TRUE. However,
> 100000 == "100000"
FALSE
Why does R have this quirky behavior, and is this a bug? What would be a workaround to, e.g., check if every element in an atomic character vector is in fact numeric? Right now I was trying to check whether x == as.numeric(x) for each x, but that fails on certain datasets due to the above problem!

Have a look at as.character(100000). Its value is not equal to "100000" (have a look for yourself), and R is essentially just telling you so.
as.character(100000)
# [1] "1e+05"
Here, from ?Comparison, are R's rules for applying relational operators to values of different types:
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of
precedence being character, complex, numeric, integer, logical and
raw.
Those rules mean that when you test whether 1=="1", say, R first converts the numeric value on the LHS to a character string, and then tests for equality of the character strings on the LHS and RHS. In some cases those will be equal, but in other cases they will not. Which cases produce inequality will be dependent on the current settings of options("scipen") and options("digits")
So, when you type 100000=="100000", it is as if you were actually performing the following test. (Note that internally, R may well/probably does use something different than as.character() to perform the conversion):
as.character(100000)=="100000"
# [1] FALSE

Unexpected behavior using -which() in R when the search term is not found

I have been using the R which function to remove rows from a data frame. I recently discovered that if the search term is NOT in the data.frame, the result is an empty character.
# 1: returns A-Q, S-Z (as expected)
LETTERS[-which(LETTERS == "R")]
# 2: returns "character(0)" (not what I would expect)
LETTERS[-which(LETTERS == "1")]
# 3: returns A-Z (expected)
LETTERS[which(LETTERS != "1")]
# 4: returns A-Q, S-Z (expected)
LETTERS[which(LETTERS != "R")]
Is the second example the expected behavior for -which() when the search term is not found? I have already switched my code to use the syntax in example 4, which seems safer, but I am just curious.

That is a well-known pitfall. When nothing matches the logical test the which-function returns numeric(0) and then "[" returns nothing instead of returning everything which would be expected. You can use:
LETTERS[ ! LETTERS == "1" ]
LETTERS[ ! LETTERS %in% "1" ]
There is another gotcha to be aware of and is the one that makes me choose to use which(). When using logical indexing an NA value used inside "[" will return a row. I generally do not want that so I use DFRM[ which(logical) ] although this seems to bother some people who say is is not needed. I just think they are working with small datasets and infrequently encounter the annoyance of seeing tens of thousands of NA-induced useless lines of output on their console. I never use the negated which version though.

Because of this:
which(LETTERS == '-1')
## integer(0)
and this:
(1:2)[integer(0)]
integer(0)
Instead of #4, use this:
LETTERS[LETTERS != "R"]

In example 2, which returns integer(0) (a zero-length integer vector) because no values are TRUE. A negative zero-length vector (-integer(0)) is still a zero-length vector. So you're essentially asking for the NULL element of LETTERS, which doesn't exist.

What do the %op% operators in mean? For example "%in%"?

I tried to do this simple search but couldn't find anything on the percent (%) symbol in R.
What does %in% mean in the following code?
time(x) %in% time(y) where x and y are matrices.
How do I look up help on %in% and similar functions that follow the %stuff% pattern, as I cannot locate the help file?
Related questions:
What does eg %+% do? in R
The R %*% operator
What does %*% mean in R
What does %||% do in R?
What does %>% mean in R

I didn't think GSee's or Sathish's answers went far enough because "%" does have meaning all by itself and not just in the context of the %in% operator. It is the mechanism for defining new infix operators by users. It is a much more general issue than the virtues of the %in% infix operator or its more general prefix ancestor match. It could be as simple as making a pairwise "s"(um) operator:
`%s%` <- function(x,y) x + y
Or it could be more interesting, say making a second derivative operator:
`%DD%` <- function(expr, nam="x") { D(D( bquote(.(expr)), nam), nam) }
expression(x^4) %DD% "x"
# 4 * (3 * x^2)
The %-character also has importance in the parsing of Date, date-time, and C-type format functions like strptime, formatC and sprintf.
Since that was originally written we have seen the emergence of the magrittr package with the dplyr elaboration that demonstrates yet another use for %-flanked operators.
So the most general answer is that % symbols are handled specially by the R parser. Since the parser is used to process plotmath expressions, you will also see extensive options for graphics annotations at the ?plotmath help page.

%op% denotes an infix binary operator. There are several built-in operators using %, and you can also create your own.
(A single % sign isn't a keyword in R. You can see a list of keywords on the ?Reserved help page.)
How do I get help on binary operators?
As with anything that isn't a standard variable name, you have to to enclose the term in quotes or backquotes.
?"%in%"
?`%in%`
Credit: GSee's answer.
What does %in% do?
As described on the ?`%in%` help page (which is actually the ?match help page since %in% is really only an infix version of match.),
[%in%] returns a logical vector indicating if there is a match or not for its left operand
It is most commonly used with categorical variables, though it can be used with numbers as well.
c("a", "A") %in% letters
## [1] TRUE FALSE
1:4 %in% c(2, 3, 5, 7, 11)
## [1] FALSE TRUE TRUE FALSE
Credit: GSee's answer, Ari's answer, Sathish's answer.
How do I create my own infix binary operators?
These are functions, and can be defined in the same way as any other function, with a couple of restrictions.
It's a binary opertor, so the function must take exactly two arguments.
Since the name is non-standard, it must be written with quotes or backquotes.
For example, this defines a matrix power operator.
`%^%` <- function(x, y) matrixcalc::matrix.power(x, y)
matrix(1:4, 2) %^% 3
Credit: BondedDust's answer, Ari's answer.
What other % operators are there?
In base R:
%/% and %% perform integer division and modular division respectively, and are described on the ?Arithmetic help page.
%o% gives the outer product of arrays.
%*% performs matrix multiplication.
%x% performs the Kronecker product of arrays.
In ggplot2:
%+% replaces the data frame in a ggplot.
%+replace% modifies theme elements in a ggplot.
%inside% (internal) checks for values in a range.
%||% (internal) provides a default value in case of NULL values. This function also appears internally in devtools, reshape2, roxygen2 and knitr. (In knitr it is called %n%.)
In magrittr:
%>% pipes the left-hand side into an expression on the right-hand side.
%<>% pipes the left-hand side into an expression on the right-hand side, and then assigns the result back into the left-hand side object.
%T>% pipes the left-hand side into an expression on the right-hand side, which it uses only for its side effects, returning the left-hand side.
%,% builds a functional sequence.
%$% exposes columns of a data.frame or members of a list.
In data.table:
%between% checks for values in a range.
%chin% is like %in%, optimised for character vectors.
%like% checks for regular expression matches.
In Hmisc:
%nin% returns the opposite of %in%.
In devtools:
%:::% (internal) gets a variable from a namespace passed as a string.
In sp:
%over% performs a spatial join (e.g., which polygon corresponds to some points?)
In rebus:
%R% concatenates elements of a regex object.
More generally, you can find all the operators in all the packages installed on your machine using:
library(magrittr)
ip <- installed.packages() %>% rownames
(ops <- setNames(ip, ip) %>%
lapply(
function(pkg)
{
rdx_file <- system.file("R", paste0(pkg, ".rdx"), package = pkg)
if(file.exists(rdx_file))
{
rdx <- readRDS(rdx_file)
fn_names <- names(rdx$variables)
fn_names[grepl("^%", fn_names)]
}
}
) %>%
unlist
)

Put quotes around it to find the help page. Either of these work
> help("%in%")
> ?"%in%"
Once you get to the help page, you'll see that
‘%in%’ is currently defined as
‘"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0’
Since time is a generic, I don't know what time(X2) returns without knowing what X2 is. But, %in% tells you which items from the left hand side are also in the right hand side.
> c(1:5) %in% c(3:8)
[1] FALSE FALSE TRUE TRUE TRUE
See also, intersect
> intersect(c(1:5), c(3:8))
[1] 3 4 5

More generally, %foo% is the syntax for a binary operator. Binary operators in R are really just functions in disguise, and take two arguments (the one before and the one after the operator become the first two arguments of the function).
For example:
> `%in%`(1:5,4:6)
[1] FALSE FALSE FALSE TRUE TRUE
While %in% is defined in base R, you can also define your own binary function:
`%hi%` <- function(x,y) cat(x,y,"\n")
> "oh" %hi% "my"
oh my

%in% is an operator used to find and subset multiple occurrences of the same name or value in a matrix or data frame.
For example 1: subsetting with the same name
set.seed(133)
x <- runif(5)
names(x) <- letters[1:5]
x[c("a", "d")]
# a d
# 0.5360112 0.4231022
Now you change the name of "d" to "a"
names(x)[4] <- "a"
If you try to extract the similar names and its values using the previous subscript, it will not work. Notice the result, it does not have the elements of [1] and [4].
x[c("a", "a")]
# a a
# 0.5360112 0.5360112
So, you can extract the two "a"s from different position in a variable by using %in% binary operator.
names(x) %in% "a"
# [1] TRUE FALSE FALSE TRUE FALSE
#assign it to a variable called "vec"
vec <- names(x) %in% "a"
#extract the values of two "a"s
x[vec]
# a a
# 0.5360112 0.4231022
Example 2: Subsetting multiple values from a column
Refer this site for an example

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

%in% vs '==' when comparing date with date_as_string - r

Related

Why does empty logical vector pass the stopifnot() check?

How to specify "does not contain" in dplyr filter

Why does 1..99,999 == "1".."99,999" in R, but 100,000 != "100,000"?

Unexpected behavior using -which() in R when the search term is not found

What do the %op% operators in mean? For example "%in%"?

Categories

Resources