Checking NA values and difference - r

What is the difference between sum(is.na(bollywood), T) and sum(is.na(bollywood))?
I have tried both of these but they are giving different output and I'm not sure of the reason.

The T at the end is for na.rm argument. It is better to spell out the TRUE instead of T as it is not possible to assign TRUE as object name while T can be assigned and this can lead to some buggy situations
sum(is.na(bollywood, na.rm = TRUE)
Here, there is no need for na.rm as is.na is returning only TRUE/FALSE as output depending on whether the object element have NA or not. sum will give the sum of all TRUE elements (TRUE - 1, FALSE - 0)
Using a small reproducible example
bollywood <- c('a', 'b', NA, 'd', NA)
is.na(bollywood)
#[1] FALSE FALSE TRUE FALSE TRUE
here, there are two NAs, so sum returns 2
sum(is.na(bollywood))
#[1] 2
Now, we define an object T
T <- 5
sum(is.na(bollywood), T)
#[1] 7
So, here, it adds the 2 with T value of 5
instead, it should be
sum(is.na(bollywood), na.rm = TRUE)
#[1] 2
As noted above, na.rm is not at all needed. If we check the documentation of ?sum, usage is
sum(..., na.rm = FALSE)
i.e. ... can take multiple arguments, so, the is.na(bollywood) would be the first argument, the T (object created) second argument, and so on

Related

Why do conditions with %in% ignore missing values?

I encountered an unexpected output when I used %in% in a condition whilst recoding a categorical variable.
When an element of a vector on the left is NA, the condition evaluates as FALSE, whilst I expected it to be NA.
The expected behaviour is the more verbose statement with two == conditions separated by an |
dt <- data.frame(colour = c("red", "orange", "blue", NA))
# Expected
dt$is_warm1 <- ifelse(dt$colour == "red" | dt$colour == "orange", TRUE, FALSE)
# Unexpected
dt$is_warm2 <- ifelse(dt$colour %in% c("red", "orange"), TRUE, FALSE)
dt
#> colour is_warm1 is_warm2
#> 1 red TRUE TRUE
#> 2 orange TRUE TRUE
#> 3 blue FALSE FALSE
#> 4 <NA> NA FALSE
This is quite unhelpful when recoding categorical variables because it silently fills missing values. Why does this happen, and are there any alternatives that don't involve listing all the == conditions? (Imagine that colour contains thirty possible levels).
a %in% b is just shorthand for match(a, b, nomatch = 0) > 0 (check the source code for %in% to satisfy yourself that this is the case).
You can get your expected result by removing the nomatch = 0 argument:
match(dt$colour, c("red", "orange")) > 0
#> [1] TRUE TRUE NA NA
Which of course doesn't require the ifelse
%in% checks to see if NA is in the list. Consider these two scenarios
NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE
This allows you to check of NA is in the vector or not.
If you want to preserve NA values, you could write your own alternative
`%nain%` <- function(val, list) {
ifelse(is.na(val), NA, val %in% list)
}
And then you can use
dt$is_warm3 <- dt$colour %nain% c("red", "orange")
Here is some info from the help documentation ?%in%
So you can see in the last line %in% never returns NA so that is why it returns FALSE and not NA. It is checking for missing values as #MrFlick mentioned in his answer
Exactly what matches what is to some extent a matter of definition.
For all types, NA matches NA and no other value. For real and complex
values, NaN values are regarded as matching any other NaN value, but
not matching NA, where for complex x, real and imaginary parts must
match both (unless containing at least one NA).
Character strings will be compared as byte sequences if any input is
marked as "bytes", and otherwise are regarded as equal if they are in
different encodings but would agree when translated to UTF-8 (see
Encoding).
That %in% never returns NA makes it particularly useful in if
conditions.

Indexing tables of logical vectors with zero counts in R

I have the following:
> v1 <- c(T, F, T, T, F)
> table(v)
v
FALSE TRUE
2 3
To index the 'True' column, I do this:
> `table(v1)[2]`
TRUE
3
However, if a logical vector contains only FALSE values, the table will only have one column and the previos strategy no longer works to retrieve the TRUE column:
> v2 <- c(F, F, F, F, F)
> table(v2)[2]
<NA>
NA
How can one consistently index the TRUE column regardless of if its count is zero? One solution is to do this:
> table(factor(v2, levels= c("FALSE", "TRUE")))[2]
TRUE
0
But this feels like cheating because it treats TRUE and FALSE as characters that become levels of a factor. For non-logical vectors, this behaviour is understandable, because there is no way of knowing what levels exist. (1) Is there a way to force table() to take into consideration the fact that logical vectors only take on two values and always present two columns for them? (2) Am I overthinking this and the last command is an acceptable and robust practice?
Convert to factor with levels specified so that it always have two levels - without a TRUE value, there is no way the table to create the count of TRUE as that information is not present. With factor levels, it gives the TRUE count to be 0
table(factor(v2, levels = c(FALSE, TRUE)))[2]
It is not clear why a logical vector TRUE values needs to be counted with table and then extract based on the TRUE, FALSE names. It can be more easily done with sum as TRUE -> 1 and FALSE -> 0, negating (!) reverses this
> sum(v1)
[1] 3
> sum(!v1)
[1] 2
> sum(v2)
[1] 0
> sum(!v2)
[1] 5
Because the case of logical is so specific for the requirements, I would write a specific function:
logitable <- function(x)
{
x <- as.logical(x)
kNA <- sum(is.na(x))
kT <- sum(x, na.rm=TRUE)
kF <- length(x) - kT - kNA
return (structure(
c(kT, kF, kNA),
names = c("TRUE", "FALSE", "NA")
))
}
Please note that the type of the return object is not of class "table" --- let me know if this is important to you, to return such an object.
Test with:
logitable(c(T,F,T,F,T))
logitable(c(T,T,T,T,T))
logitable(c(F,F,F,F,F))
logitable(c(T,F,T,F,NA))

Subsetting a logical vector with a logical vector in R

(Note: following the suggestions in the comments, I have changed the original title "Comparing the content of two vectors in R?" to "Subsetting a logical vector with a logical vector in R")
I am trying to understand the following R code snippet (by the way, the question originated while I was trying to understand this example.)
I have a vector a defined as:
a = c(FALSE, FALSE)
Then I can define b:
b <- a
I check b's content and everything looks OK:
b
#> [1] FALSE FALSE
Question
Now, what is the following code doing? Is it checking if b is equal to "not" a?
b[!a]
#> [1] FALSE FALSE
But if I try b[a] the result is different:
b[a]
#> logical(0)
I also tried a different example:
a = c(FALSE, TRUE)
b <- a
b
#> [1] FALSE TRUE
Now I try the same operations as above, but I get a different result:
b[!a]
#> [1] FALSE
b[a]
#> [1] TRUE
Created on 2021-03-23 by the reprex package (v0.3.0)
[] is used for subsetting a vector. You can subset a vector using integer index or logical values.
When you are using logical vector to subset a vector, a value in the vector is selected if it is TRUE. In your example you are subsetting a logical vector with a logical vector which might be confusing. Let's take another example :
a <- c(10, 20)
b <- c(TRUE, FALSE)
a[b]
#[1] 10
Since 1st value is TRUE and second is FALSE, the first value is selected.
Now if we invert the values, 20 would be selected because !b returns FALSE TRUE.
a[!b]
#[1] 20
Now implement this same logic in your example -
a = c(FALSE, FALSE)
b <- a
!b returns TRUE TRUE, hence both the values are selected when you do b[!a] and the none of the value is selected when you do b[a].
b[!a] will result in displaying those values of b which are at TRUE positions as evalauted by !a.
!a is actually T, T therefore displays first and second values of b which are F and F
More efficiently please see this
a <- 1:4
b <- c(T, T, F, T)
now a[!b] will display a[c(F, F, T, F)] i.e. only third element of a

How to check if 2 vectors are the same in which NA is treated as a normal value? [duplicate]

Here is a vector
a <- c(TRUE, FALSE, FALSE, NA, FALSE, TRUE, NA, FALSE, TRUE)
I'd like a simple function that returns TRUE everytime there is a TRUE in "a", and FALSE everytime there is a FALSE or a NA in "a".
The three following things do not work
a == TRUE
identical(TRUE, a)
isTRUE(a)
Here is a solution
a[-which(is.na(a))]
but it doesn't seem to be a straightforward and easy solution
Is there another solution ?
Here are some functions (and operators) I know:
identical()
isTRUE()
is.na()
na.rm()
&
|
!
What are the other functions (operators, tips, whatever,...) that are
useful to deal with TRUE, FALSE, NA, NaN?
What are the differences between NA and NaN?
Are there other "logical things" than TRUE, FALSE, NA and NaN?
Thanks a lot !
You don't need to wrap anything in a function - the following works
a = c(T,F,NA)
a %in% TRUE
[1] TRUE FALSE FALSE
To answer your questions in order:
1) The == operator does indeed not treat NA's as you would expect it to. A very useful function is this compareNA function from r-cookbook.com:
compareNA <- function(v1,v2) {
# This function returns TRUE wherever elements are the same, including NA's,
# and false everywhere else.
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
return(same)
}
2) NA stands for "Not available", and is not the same as the general NaN ("not a number"). NA is generally used for a default value for a number to stand in for missing data; NaN's are normally generated because a numerical issue (taking log of -1 or similar).
3) I'm not really sure what you mean by "logical things"--many different data types, including numeric vectors, can be used as input to logical operators. You might want to try reading the R logical operators page: http://stat.ethz.ch/R-manual/R-patched/library/base/html/Logic.html.
Hope this helps!
So you want TRUE to remain TRUE and FALSE to remain FALSE, the only real change is that NA needs to become FALSE, so just do this change like:
a[ is.na(a) ] <- FALSE
Or you could rephrase to say it is only TRUE if it is TRUE and not missing:
a <- a & !is.na(a)
Taking Ben Bolker's suggestion above you could set your own function following the is.na() syntax
is.true <- function(x) {
!is.na(x) & x
}
a = c(T,F,F,NA,F,T,NA,F,T)
is.true(a)
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
This also works for subsetting data.
b = c(1:9)
df <- as.data.frame(cbind(a,b))
df[is.true(df$a),]
a b
1 1 1
6 1 6
9 1 9
And helps avoid accidentally incorporating empty rows where NA do exist in the data.
df[df$a == TRUE,]
a b
1 1 1
NA NA NA
6 1 6
NA.1 NA NA
9 1 9
I like the is.element-function:
is.element(a, T)

Filtering with logical + NA values in one column

I have the following data frame:
df <- data.frame("Logical"=c("true",NA,"false","true","","false"),
"Numeric"=c(1,2,3,4,5,6))
unique(df$Logical)
length(df$Logical == TRUE)
I'm trying to figure out, how many TRUE-values do I have in my df$Logical column. But seems I'm missing something and length(df$Logical == TRUE) returns no of records in my logical column.
What I'm doing wrong in this particular case. Desired result should be 2 for TRUE-values in df$Logical column. Many thanks in advance.
We need to specify the string in the lower case as the values were 'true/false' and not exactly TRUE/FALSE. Also, instead of length, sum should be used. The sum gets the number of TRUE elements.
sum(df$Logical == "true")
#[1] 2
If there are NA elements in the column, use na.rm = TRUE
sum(df$Logical=='true', na.rm = TRUE)
#[1] 2
The length of a logical or any other vector would be the same as the original length/number of rows of the dataset.
length(df$Logical == "true")
#[1] 6
because it returns a logical vector of length 6.
df$Logical == "true"
#[1] TRUE FALSE FALSE TRUE FALSE FALSE
To get the counts of both true and false, we can use table
table(df$Logical)
First of all "true" and "false" as you put it into you data frame are not Booleans but simple strings.
Moreover, length(df$Logical == TRUE) will always return 6 in this example, i.e. the number of elements in the column. This is because df$Logical == TRUE returns a sequence of TRUE or FALSE. In your case it will return
FALSE NA FALSE FALSE FALSE FALSE
because the boolean expression is never true. However, the length of this will be 6 as returned by length().
To overcome the problem you might define your data frame like this
df <- data.frame("Logical"=c(TRUE,NA,FALSE,FALSE,NA,FALSE),
"Numeric"=c(1,2,3,4,5,6))
And then you can sum up the number of TRUE
sum(df$Logical == TRUE, na.rm = T)
[1] 2
na.rm = T is important here because otherwise the sum will return NA if one more more elements are NA.
Alternatively, you can work with strings to indicate true or false (and empty strings a NA)
Then you could write
df <- data.frame("Logical"=c("true",NA,"false","true","","false"),
"Numeric"=c(1,2,3,4,5,6))
sum(df$Logical == "true", na.rm = T)
[1] 2

Resources