I discovered a strange behavior in R and wanted to ask whether it can lead to problems.
Here is an example of what I mean:
x<-c(1,2,3,4,NA, 5)
x==1
x
this returns:
TRUE FALSE FALSE FALSE NA FALSE
Now say I want to change all x==1 to 100:
x[x==1]<-100
x
returns:
100 2 3 4 NA 5
So for some reason referencing x with a logical vector with an NA worked. Why?
To me this behavior of "==" with NAs doesn't make much sense.
I am also worried now that this kind of assignment x[x==y]<-... may have caused (unnoticed) problems with other types of vectors such as in data.frames or matrices. Has anyone encountered any problems with this or does this logical NA referencing work consistently with other types of vectors?
Related
This question already has answers here:
Logical operators (AND, OR) with NA, TRUE and FALSE
(2 answers)
Closed 4 years ago.
Why is this so in R?
> F & NA
[1] FALSE
> T & NA
[1] NA
I would expect the first line of the code to evaluate to NA as well. People have told me 'this is simply strange behavior of R', but is there some other notion to it?
If you have an AND (&) statement and one of the values is false, then it doesn't matter what the other value is, the answer is going to be false. The NA value means that a value is missing, but the unobserved value must be a true or false and either way you're going to get false back.
But if one of the values is true, then the AND will only be true if the second value is also true. However in this case the missing value (NA), could be true or false so it's impossible to say whether the expression will be. Thus R has to propagate the NA value.
Since all values other than 0 are taken as true in R, isTRUE(3) should logically evaluate to True but it doesn't. Why so?
Also, I would like to know the reason behind isTRUE(NA) being evaluated to false.
Straight from the documentation (try ?isTRUE)
isTRUE(x) is an abbreviation of identical(TRUE, x), and so is true if and only if x is a length-one logical vector whose only element is TRUE and which has no attributes (not even names).
It's not just doing a check on value, it's doing a check to ensure it is a logical value.
I know in computer science often 0 is false and anything non-zero is true, but R approaches things from a statistics point of view, not a computer science point of view, so it's a bit stricter about the definition.
Saying this, you'll notice this if statement evaluates the way you would imagine
if(3){print("yay")}else{print("boo")}
It's just the way R going about evaluation. The function isTRUE is just more specific.
Also note these behaviours
FALSE == 0 is true
TRUE == 1 is true
TRUE == 3 is false
So R will treat 0 and 1 as false and true respectively when perform these sorts of evaluations.
I'm not sure what your planned implementation was (if there was any) but it's probably better trying to be precise about boolean logic in R, or test things beforehand.
As for NA, more strange behaviour.
TRUE & NA equates to NA
TRUE | NA equates to TRUE
In these cases R forces NA to a logical type, since anything or'd with TRUE is a TRUE, it can equate that. But the value would change depending on the second term in an and operation, so it returns NA.
As for your particular case, again isTRUE(NA) is equated as false because NA is not a length-one logical vector whose only element is TRUE.
Because this function bypass R's automatic conversion rules and check that x is literally TRUE or FALSE.
This question already has answers here:
Logical operators (AND, OR) with NA, TRUE and FALSE
(2 answers)
Closed 6 years ago.
So we have this behaviour:
any(c(TRUE, FALSE, NA))
#> [1] TRUE
any(c(TRUE, NA))
#> [1] TRUE
any(c(FALSE, NA))
#> [1] NA
Anyone know the rationale for returning NA instead of FALSE? IMO the function should be testing for presence of non-FALSE values, which NA is not.
This behavior is explained in the values section of the help file:
The value returned is TRUE if at least one of the values in x is TRUE, and FALSE if all of the values in x are FALSE (including if there are no values). Otherwise the value is NA.
As you note, this seems to differ from the behavior of more commonly used functions such as sum and mean, since the presence of NA values in vector arguments to these functions return NA. This problem in perception is cleared up by joran's answer which refers to the documentation from ?Logic, to requote:
NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous. In other words NA & TRUE evaluates to NA, but NA & FALSE evaluates to FALSE. See the examples below.
So in the case of ambiguity, for example, the calculation of a mean where the vector contains NA, or NA | FALSE where the missing value might be TRUE, NA will be the output. Whereas in other cases such as any(c(TRUE, NA)) or TRUE | NA, the outcome is unambiguous despite the presence of a missing value. This logic may be clearer in #Floo0's answer and in some of the comments to the question.
I might be mistaken but the logic here is:
NA means unknown value. So the question
Is any of value of (FALSE, NA) true?
Is answered with "I dont know" aka NA because NA could be TRUE but it is unknown at the moment you are asking.
Take the question
Is any value of (TRUE, NA) true?
This is answered with TRUE as certainly the first value is TRUE.
I would wrap the call in isTRUE, this yields the desired result:
> any(c(FALSE, NA))
[1] NA
> isTRUE(any(c(FALSE, NA)))
[1] FALSE
From the documentation:
‘isTRUE(x)’ is an abbreviation of ‘identical(TRUE, x)’, and so is
true if and only if ‘x’ is a length-one logical vector whose only
element is ‘TRUE’ and which has no attributes (not even names).
So I've been playing around with a data frame in R, although I'm still thinking too much in Python and cannot seem to find a solution for my problem.
I have a data frame and one of the column is an user id. I would like to remove all the first occurrence of a number, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where the first time the user_id appears I would remove it but keep all the others even if repeated.
With python I would probably use enumerate or loop over it. For R, I've seen some functions that seem cool but I'm not sure how to use it with the data frame, like rle.
Any pointers will be really helpful since right now I'm a bit lost about the best approach for this problem.
Thank you all
The function duplicated() is going to be helpful here:
x <- c(1,2,3,4,3,4,2,1,3,4,6,7,7)
> x[duplicated(x)]
[1] 3 4 2 1 3 4 7
This works because duplicated() returns a logical vector indicating whether that element is, well, duplicated:
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
You then use this logical vector to subset (extract) the values you want from x. But notice that in the extraction I keep all of the duplicated values, not remove them.
To remove all of the duplicated values (not what you want, but I illustrate regardless), try the negation:
x[!duplicated(x)]
[1] 1 2 3 4 6 7
Suppose I have a vector x<-c(1,2,NA,4,5,NA).
I apply some mythological code to that vector, which results in another vector, y<-c(1,NA,3, 4,10,NA)
Now I wish to find out at which positions my two vectors differ, where I count two NAs as being the same, and one NA and a non-NA (e.g. the second element of the two example vectors).
Specifically, for my example, I would like to end up with a vector holding c(2,3,5).
For my use case, I am not content with a vector of logical variables, but obviously I can easily convert (which), so I'll accept that as well.
I have some solutions like:
simplediff<-x!=y
nadiff<-is.na(x)!=is.na(y)
which(simplediff | nadiff)
but it feels like I'm reinventing the wheel here. Any better options?
How about looping and using identical?
!mapply(identical,x,y)
[1] FALSE TRUE TRUE FALSE TRUE FALSE
And for positions:
seq_along(x)[!mapply(identical,x,y)]
[1] 2 3 5
or
which(!mapply(identical,x,y))
[1] 2 3 5
One posible solution (but sure it is not the best):
(1:length(x))[-which((x-y)==0)]