Subsetting a vector with a condition (excluding NA) - r

vector1 = c(1,2,3,NA)
condition1 = (vector1 == 2)
vector1[condition1]
vector1[condition1==TRUE]
In the above code, the condition1 is "FALSE TRUE FALSE NA",
and the 3rd and the 4th lines both gives me the result "2 NA"
which is not I expected.
I wanted elements whose values are really '2', not including NA.
Could anybody explain why R is designed to work in this way?
and how I can get the result I want with a simple command?

The subset vector[NA] will always be NA because the NA value is unknown and therefore the result of the subset is also unknown. %in% returns FALSE for NA, so it can be useful here.
vector1 = c(1,2,3,NA)
condition1 = (vector1 %in% 2)
vector1[condition1]
# [1] 2

If you are in RStudio and enter
?`[`
You will get the following explanation:
NAs in indexing
When extracting, a numerical, logical or character NA index picks an
unknown element and so returns NA in the corresponding element of a
logical, integer, numeric, complex or character result, and NULL for a
list. (It returns 00 for a raw result.)
When replacing (that is using indexing on the lhs of an assignment) NA
does not select any element to be replaced. As there is ambiguity as
to whether an element of the rhs should be used or not, this is only
allowed if the rhs value is of length one (so the two interpretations
would have the same outcome). (The documented behaviour of S was that
an NA replacement index ‘goes nowhere’ but uses up an element of
value: Becker et al p. 359. However, that has not been true of other
implementations.)

try the logical operator in that case,
vector1 = c(1,2,3,NA)
condition1<-(vector1==2 & !is.na(vector1) )
condition1
# FALSE TRUE FALSE FALSE
vector1[condition1]
# 2
& operation returns true when both of the logical operators are True.

identical is "The safe and reliable way to test two objects for being exactly equal. It returns TRUE in this case, FALSE in every other case." (see ?identical)
As it does not compare elementwise comparison you can use it in sapply to compare each element in vector1 to 2. I.e.:
condition1 = sapply(vector1, identical, y = 2)
which will give:
vector1[condition1]
[1] 2

Related

How does subsetting with NA work?

Can someone please answer in layman terms how indexing (subsetting) with NA works. Even though there are some answers from google, I would like to understand it better in simple terms.
When indexing a vector (of length > 1) using a single NA, why does it yield five missing values?
> x <- 1:5
> x[NA]
[1] NA NA NA NA NA
From help("["):
When extracting, a numerical, logical or character NA index picks an
unknown element and so returns NA in the corresponding element of a
logical, integer, numeric, complex or character result, and NULL for a
list.
What does "corresponding element" mean? This can be understood if you know about recycling of vector elements. x[NA] (this is a logical NA per default) in your example is actually "interpreted" as x[c(NA, NA, NA, NA, NA)] since logical indices are recycled. So, each element of x has a corresponding NA during subsetting and thus (per the quote above) NA is returned for each element of x. In layman's language: For each element of x we don't know if we want it. Thus an unknown value is returned for each element.
As #r2evans points out: x[NA_integer_] returns only one NA because integer indices are not recycled. In layman's language: We want one value but don't know which one. Thus, one unknown value is returned.

Why does any() return NA when no true values [duplicate]

This question already has answers here:
Logical operators (AND, OR) with NA, TRUE and FALSE
(2 answers)
Closed 6 years ago.
So we have this behaviour:
any(c(TRUE, FALSE, NA))
#> [1] TRUE
any(c(TRUE, NA))
#> [1] TRUE
any(c(FALSE, NA))
#> [1] NA
Anyone know the rationale for returning NA instead of FALSE? IMO the function should be testing for presence of non-FALSE values, which NA is not.
This behavior is explained in the values section of the help file:
The value returned is TRUE if at least one of the values in x is TRUE, and FALSE if all of the values in x are FALSE (including if there are no values). Otherwise the value is NA.
As you note, this seems to differ from the behavior of more commonly used functions such as sum and mean, since the presence of NA values in vector arguments to these functions return NA. This problem in perception is cleared up by joran's answer which refers to the documentation from ?Logic, to requote:
NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous. In other words NA & TRUE evaluates to NA, but NA & FALSE evaluates to FALSE. See the examples below.
So in the case of ambiguity, for example, the calculation of a mean where the vector contains NA, or NA | FALSE where the missing value might be TRUE, NA will be the output. Whereas in other cases such as any(c(TRUE, NA)) or TRUE | NA, the outcome is unambiguous despite the presence of a missing value. This logic may be clearer in #Floo0's answer and in some of the comments to the question.
I might be mistaken but the logic here is:
NA means unknown value. So the question
Is any of value of (FALSE, NA) true?
Is answered with "I dont know" aka NA because NA could be TRUE but it is unknown at the moment you are asking.
Take the question
Is any value of (TRUE, NA) true?
This is answered with TRUE as certainly the first value is TRUE.
I would wrap the call in isTRUE, this yields the desired result:
> any(c(FALSE, NA))
[1] NA
> isTRUE(any(c(FALSE, NA)))
[1] FALSE
From the documentation:
‘isTRUE(x)’ is an abbreviation of ‘identical(TRUE, x)’, and so is
true if and only if ‘x’ is a length-one logical vector whose only
element is ‘TRUE’ and which has no attributes (not even names).

How do I count the number of pattern occurrences, if the pattern includes NA, in R?

I have a string of 0's, 1's and NA's like so:
string<-c(0,1,1,0,1,1,NA,1,1,0,1,1,NA,1,0,
0,1,0,1,1,1,NA,1,0,1,NA,1,NA,1,0,1,0,NA,1)
I'd like to count the number of times the PATTERN "1-NA-1" occurs. In this instance, I would like get the count 5.
I've tried table(string), and trying to replicate this but nothing seems to work. I would appreciate anyone's help!
# some ugly code, but it seems to work
sum( head(string, -2) == 1 & is.na(head(string[-1],-1))
& string[-1:-2] == 1, na.rm = TRUE)
Something like:
x <- which(is.na(string))
x <- x[!x %in% c(1,length(string))]
length(x[string[x-1] & string[x+1]])
# [1] 5
-- REASONING --
First, we check which values of string are NA with is.na(string). Then we find those indices with which and store them in x.
As #Rick mentions, if the first/last value is NA it would lead to problems in our next step. So, we make sure that those are removed (as it shouldn't count anyway).
Next, we want to find the situation where both string[x-1] and string[x+1] are 1. In other words, 1 & 1. Note that FALSE and TRUE can be evaluated as 0 and 1 respectively. So, if you type 1 == TRUE you will get TRUE. If you type 1 & 1 you will also get TRUE back. So, string[x-1] & string[x+1] will return TRUE when both are 1, and FALSE otherwise. We basically obtain a logical vector, and subset x with that vector to get all positions in x that satisfy our search. Then we use length to determine how many there are.

Counting number of Boolean switches in R

With the array:
my_array <- c(F,T,T,F,F,T,T,T,F,T,F)
I need a script that will tell me how many times the value went from False to True. Just by eye it's easy to see it did that 3 times. I'm only interested on it switching from False to True and NOT from True to False.
Since you care only about the times when it went from FALSE to TRUE, this is the number of times the diff of the vector is equal to 1:
sum(diff(my_array) == 1)
# [1] 3
This is in my opinion the most direct way to address your question, but note that R also has the excellent rle function that returns the run-length encoding of your vector, namely the length of each section of consecutive values within the vector. You could use rle to address your particular query by counting the number of runs (excluding the last) that take the FALSE value:
sum(head(rle(my_array)$values, -1) == FALSE)
# [1] 3
Note that both of these solutions took advantage of the fact that this is a vector with only TRUE/FALSE values. A general approach to count the number of transitions from some value A to some value B is to compare head(vector, -1) with tail(vector, -1) -- namely all but the last element of the vector against all but the first. In your case:
sum(head(my_array, -1) == FALSE & tail(my_array, -1) == TRUE)
# [1] 3
The first element of head(my_array, -1) == FALSE indicates whether the first element of my_array is FALSE, the second element is whether the second element is FALSE, and so on. Meanwhile, the first element of tail(my_array, -1) == TRUE indicates whether the second element of my_array is TRUE, the second element indicates whether the third element is TRUE, and so on. Therefore, the corresponding elements of head(my_array, -1) and tail(my_array, -1) are one apart and enable us to check conditions on pairs of elements.

Types and comparisons in R

I've been working with R for a month or so, and my comprehension of some subtleties is still quite superficial.
I have had an issue, which I managed to solve (details below), but I still can't explain precisely why it did not work with the first solution.
Note that the example below makes no practical sense for I have simplified it as much as possible so that the problem is quite clear.
ISSUE :
Given a data frame with 4 columns (email, first, last, company) :
> users <- data.frame(matrix(vector(), 0, 4, dimnames=list(c(), c("email", "first", "last", "company"))), stringsAsFactors=F)
> users[1,] <- c("robert#redford.com", "Robert", "Redford", "Paramount")
> users[2,] <- c("julia#roberts.com", "Erin", "B.", "Hinkley")
> users[3,] <- c("matt#damon.com", "Will", "H.", "Stanford")
> users[4,] <- c("john#malkovitch.com", "John", "M.", "JM")
I take one particular row :
> user <- users[3,]
When I try to subset the dataframe on a criteria which could have lead to return the previously mentioned row, it returns no result.
> users[users$email == user["email"],]
[1] email first last company
<0 lignes> (ou 'row.names' de longueur nulle)
I instantly thought it was a casting issue (sorry for this bad one)
> users[users$email == as.character(user["email"]),]
email first last company
3 matt#damon.com Will H. Stanford
However, when I tried to figure out where exactly the issue was, and tried this :
> users[users$email == "matt#damon.com",]
email first last company
3 matt#damon.com Will H. Stanford
> user["email"] == "matt#damon.com"
email
3 TRUE
> users[3,]$email == user$email
[1] TRUE
I got quite confused :
First, I thought about it as a math problem : if A == B and B == C, then A == C (according to Captain Obvious). So, just replacing a member A by another member B which is supposed to be equal to A (given the "TRUE" statement) in some expression should have no impact on the result of this expression.
3 TRUE != [1] TRUE. I think [1] TRUE is a logical vector of size 1 which first element is TRUE. 3 TRUE is (1x1) matrix row, which column "email" value is TRUE.
My problem is with consistency : either two objects of equal content but different types should be equal, or they should be different. I have a problem with "Sometimes there is type inference, and sometimes not". Is there a rule I can't see beyond this behavior ? (I guess there is one)
Another expression of the behavior I'd like to get is this one :
> unique(users$email) == "matt#damon.com"
[1] FALSE FALSE TRUE FALSE
> unique(users$email) == user["email"]
email
3 FALSE
Obviously R does get what I want (considering the fact that it gives me the matching row). But I can't explain (nor use) the result of the second statement.
Any explanations / thoughts?
in normal list situations
users$email == user[["email"]]
however in data.frames things get inconsistent/ a lot worse!
tdf=data.frame(matrix(1:100,10,10))
tdf[] # returns data.frame everything
tdf[1] # returns data.frame first column
tdf[1,1] # returns object as type of the object...
tdf[,1] # returns a vector of the first column
tdf[1,] # returns a data.frame of the first row # eeeeeugh... that is odd....
tdf[2:4] # returns a data.frame with 3 columns
tdf[1,2:4] # returns a data.frame of the first row of 3 colums
tdf[2:4,2:4] # returns a 3x3 data.frame
tdf[2:4,1] # returns a vector of 2:4 row and 1st column
tdf[,2:4] # returns a data.frame with 3 columns
then there is also the double [[]]
do note that in data.frames things get horribly annoying and fugly
tdf[[1]] # gives the first row as a vector
tdf[[1,1]] # gives first element
and pretty much all other combinations gives errors
and assigning stuff to a data.frame or matrix, is an even bigger mess!

Resources