I encountered an unexpected output when I used %in% in a condition whilst recoding a categorical variable.
When an element of a vector on the left is NA, the condition evaluates as FALSE, whilst I expected it to be NA.
The expected behaviour is the more verbose statement with two == conditions separated by an |
dt <- data.frame(colour = c("red", "orange", "blue", NA))
# Expected
dt$is_warm1 <- ifelse(dt$colour == "red" | dt$colour == "orange", TRUE, FALSE)
# Unexpected
dt$is_warm2 <- ifelse(dt$colour %in% c("red", "orange"), TRUE, FALSE)
dt
#> colour is_warm1 is_warm2
#> 1 red TRUE TRUE
#> 2 orange TRUE TRUE
#> 3 blue FALSE FALSE
#> 4 <NA> NA FALSE
This is quite unhelpful when recoding categorical variables because it silently fills missing values. Why does this happen, and are there any alternatives that don't involve listing all the == conditions? (Imagine that colour contains thirty possible levels).
a %in% b is just shorthand for match(a, b, nomatch = 0) > 0 (check the source code for %in% to satisfy yourself that this is the case).
You can get your expected result by removing the nomatch = 0 argument:
match(dt$colour, c("red", "orange")) > 0
#> [1] TRUE TRUE NA NA
Which of course doesn't require the ifelse
%in% checks to see if NA is in the list. Consider these two scenarios
NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE
This allows you to check of NA is in the vector or not.
If you want to preserve NA values, you could write your own alternative
`%nain%` <- function(val, list) {
ifelse(is.na(val), NA, val %in% list)
}
And then you can use
dt$is_warm3 <- dt$colour %nain% c("red", "orange")
Here is some info from the help documentation ?%in%
So you can see in the last line %in% never returns NA so that is why it returns FALSE and not NA. It is checking for missing values as #MrFlick mentioned in his answer
Exactly what matches what is to some extent a matter of definition.
For all types, NA matches NA and no other value. For real and complex
values, NaN values are regarded as matching any other NaN value, but
not matching NA, where for complex x, real and imaginary parts must
match both (unless containing at least one NA).
Character strings will be compared as byte sequences if any input is
marked as "bytes", and otherwise are regarded as equal if they are in
different encodings but would agree when translated to UTF-8 (see
Encoding).
That %in% never returns NA makes it particularly useful in if
conditions.
Related
I have the following:
> v1 <- c(T, F, T, T, F)
> table(v)
v
FALSE TRUE
2 3
To index the 'True' column, I do this:
> `table(v1)[2]`
TRUE
3
However, if a logical vector contains only FALSE values, the table will only have one column and the previos strategy no longer works to retrieve the TRUE column:
> v2 <- c(F, F, F, F, F)
> table(v2)[2]
<NA>
NA
How can one consistently index the TRUE column regardless of if its count is zero? One solution is to do this:
> table(factor(v2, levels= c("FALSE", "TRUE")))[2]
TRUE
0
But this feels like cheating because it treats TRUE and FALSE as characters that become levels of a factor. For non-logical vectors, this behaviour is understandable, because there is no way of knowing what levels exist. (1) Is there a way to force table() to take into consideration the fact that logical vectors only take on two values and always present two columns for them? (2) Am I overthinking this and the last command is an acceptable and robust practice?
Convert to factor with levels specified so that it always have two levels - without a TRUE value, there is no way the table to create the count of TRUE as that information is not present. With factor levels, it gives the TRUE count to be 0
table(factor(v2, levels = c(FALSE, TRUE)))[2]
It is not clear why a logical vector TRUE values needs to be counted with table and then extract based on the TRUE, FALSE names. It can be more easily done with sum as TRUE -> 1 and FALSE -> 0, negating (!) reverses this
> sum(v1)
[1] 3
> sum(!v1)
[1] 2
> sum(v2)
[1] 0
> sum(!v2)
[1] 5
Because the case of logical is so specific for the requirements, I would write a specific function:
logitable <- function(x)
{
x <- as.logical(x)
kNA <- sum(is.na(x))
kT <- sum(x, na.rm=TRUE)
kF <- length(x) - kT - kNA
return (structure(
c(kT, kF, kNA),
names = c("TRUE", "FALSE", "NA")
))
}
Please note that the type of the return object is not of class "table" --- let me know if this is important to you, to return such an object.
Test with:
logitable(c(T,F,T,F,T))
logitable(c(T,T,T,T,T))
logitable(c(F,F,F,F,F))
logitable(c(T,F,T,F,NA))
What is the difference between sum(is.na(bollywood), T) and sum(is.na(bollywood))?
I have tried both of these but they are giving different output and I'm not sure of the reason.
The T at the end is for na.rm argument. It is better to spell out the TRUE instead of T as it is not possible to assign TRUE as object name while T can be assigned and this can lead to some buggy situations
sum(is.na(bollywood, na.rm = TRUE)
Here, there is no need for na.rm as is.na is returning only TRUE/FALSE as output depending on whether the object element have NA or not. sum will give the sum of all TRUE elements (TRUE - 1, FALSE - 0)
Using a small reproducible example
bollywood <- c('a', 'b', NA, 'd', NA)
is.na(bollywood)
#[1] FALSE FALSE TRUE FALSE TRUE
here, there are two NAs, so sum returns 2
sum(is.na(bollywood))
#[1] 2
Now, we define an object T
T <- 5
sum(is.na(bollywood), T)
#[1] 7
So, here, it adds the 2 with T value of 5
instead, it should be
sum(is.na(bollywood), na.rm = TRUE)
#[1] 2
As noted above, na.rm is not at all needed. If we check the documentation of ?sum, usage is
sum(..., na.rm = FALSE)
i.e. ... can take multiple arguments, so, the is.na(bollywood) would be the first argument, the T (object created) second argument, and so on
Following is related to R language.
x1 <- c(1, 4, 3, NA, 7)
is.na(x1) <- which(x1 == 7)
I don't undertand, the LHS in last line gives you a vector of boolean and RHS is a value(index where x ==7, 5 in this case). So what does it mean to assign a boolean vector a value of 5?
is.na from the docs returns:
The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x, containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN, and FALSE otherwise.
Therefore, by making a logical vector(you're in essence saying wherever an index is TRUE, this should be an NA.
By "matching" these indices to the corresponding index from which, you're turning the latter into NAs wherever FALSE hence the change.
To put it in practice:
This is the output from is.na(x1):
is.na(x1)
[1] FALSE FALSE FALSE TRUE FALSE
The corresponding output from which(x==7):
which(x1 == 7)
[1] 5
Combining, the element at position 5 will now become an NA because it has been given the logical is.na() which returns TRUE
is.na(x1) <- which(x1 == 7)
x1
[1] 1 4 3 NA NA
The above turns the first index into an NA and appends two more NAs so as to make index 7 and NA.
This can be best seen by:
is.na(x1) <- c(1,7)
x1
[1] NA 4 3 NA 7 NA NA
Compare with this example from the docs:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx
[1] 0 NA 2 NA 4
From the above, it is clear that c(2,4) follows the original indices in xx hence the rest become NAs.
The code is like this
x <- 1:5
x[NA]
Why does it produce 5 NAs?
The answer to this question has two sides:
How is NA interpreted when indexing matrices?
In one of the links provided by #alexis_laz, I found a very well structured explanation of how TRUE, FALSE and NA are interpreted when indexing matrices:
Logical indices tell R which elements to include or exclude.
You have three options: TRUE, FALSE and NA
They serve to indicate whether or not the index represented in that position should be included. In other words:
TRUE == "Include the elment at this index"
FALSE == "Do not include the element at this index"
NA == "Return NA instead of this index" # loosely speaking
For example:
x <- 1:6
x[ c(TRUE, FALSE, TRUE, NA, TRUE, FALSE)]
# [1] 1 3 NA 5
An important detail is that the default storage mode for an isolated NA value is logical (try typeof(NA)). You can choose the storage mode of the NA by using NA_integer_, NA_real_ (for double), NA_complex_ or NA_character_.
Why 5 NA and not just 1?
When the length of the indices is smaller than the length of vector x, the indexing will start over to also index the values in x that have not been indexed yet. In other words, R will automatically "recycle" the indices:
(...) However, standard recycling rules apply. So in the previous example, if we drop the last FALSE, the index vector is recycled, the first element of the index is TRUE, and hence the 6th element of x is now included
x <- 1:6
x[c(TRUE, FALSE, TRUE, NA, TRUE)]
# [1] 1 3 NA 5 6
Recall the detail about the storage mode from the previous section. If you type x[NA_integer_], then you will find a different result.
Recently, I've faced a behaviour in table function that was not what I was expected:
For example, let take the following vector:
ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
If I check for NA values in my vector, "NaN" is not considered one (as expected):
is.na(ex_vec)
# [1] FALSE FALSE FALSE FALSE FALSE TRUE
But if I tried to get the different values frequencies:
table(ex_vec)
#ex_vec
#Nan Non Oui
# 1 2 1
"NaN" does not appear in the table.
However, if I "ask" table to show the NA values, I get this:
table(ex_vec, useNA="ifany")
#ex_vec
# Nan NaN Non Oui <NA>
# 1 1 2 1 1
So, the character strings "NaN" is treated as a NA value inside table call, while being treated in the ouput as a not NA value.
I know (it would be better and) I could solve my problem by converting my vector to a factor but nonetheless, I'd really like to know what's going on here. Does anyone have an idea?
When factor matches up levels for a vector it converts its exclude list to the same type as the input vector:
exclude <- as.vector(exclude, typeof(x))
so if your exclude list has NaN and your vector is character, this happens:
as.vector(exclude, typeof(letters))
[1] NA "NaN"
Oh dear. Now the real "NaN" strings will be excluded.
To fix, use exclude=NA in table (and factor if you are making factors that might hit this).
I do love this in the docs for factor:
There are some anomalies associated with factors that have ‘NA’ as
a level. It is suggested to use them sparingly, e.g., only for
tabulation purposes.
Reassuring...
First idea coming to my mind was to have a look at table definition which start by:
> table
function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
"ifany", "always"), dnn = list.names(...), deparse.level = 1)
{
Sounds logical, by default table exclude NA and NaN.
Digging within table code we see that if xis not a factor it coerce it to a factor (nothing new here, it's said in the doc).
else {
a <- factor(a, exclude = exclude)
I didn't find anything else which could have impacted the input to coerce "NaN" into NA values.
So looking into factor to get the why we find the root cause:
> factor
function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
[...] # Snipped for brievety
exclude <- as.vector(exclude, typeof(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))] # defined in the snipped part above, is the sorted unique values of input vector, coerced to char.
f <- match(x, levels)
[...]
f
}
Here we got it, the exclude parameter, even being NA values is coerced into a character vector.
So what happens is:
> ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
> excludes<-c(NA,NaN)
> as.vector(excludes,"character")
[1] NA "NaN"
> match(ex_vec,as.vector(excludes,"character"))
[1] NA NA NA NA 2 1
We do match character "NaN" as the exclude vector as been coerced to character before comparison.