why sometimes R can't tell difference between NA and 0? - r

I am trying to extract rows of data with field "var" equals 0.
But I found "NA" were taken as 0:
There are 20 rows of 0 and 809 rows of "NA".
There are total 81291 rows in data frame d.
> length(d$var[d$var == "0"])
[1] 829
> length(d$var[d$var == 0])
[1] 829
The above 829 values include both 0 and "NA"
> length(d$var[d$var == "NA"])
[1] 809
> length(d$var[d$var == NA])
[1] 81291
Why does the above code gave the length of d?

x == NA is not the way to test whether the value of some variable x is NA. Use is.na()instead:
> 2 == NA
[1] NA
> is.na(2)
[1] FALSE
Similarly, use is.null() to test whether an object is the NULL object.

Here is the solution that gives the right answer.
length(which(d$var == 0))
the reason you are facing that problem is that in your expression, the condition check does not give FALSE for the NA values, it gives NA instead and when you add the condition as the index, the values which are not FALSE are checked for. in the expression i have given, it checks for which conditions are TRUE and hence you get the right answer.

One way to evaluate this is the inelegant
length(d$var[(d$var == 0) & (!is.na(d$var))])
(or slightly more compactly, sum(d$var==0 & !is.na(d$var)))
I think your code illustrates some misunderstandings you are having about R syntax. Let's make a compact, reproducible example to illustrate:
d <- data.frame(var=c(7, 0, NA, 0))
As you point out, length(d$var[d$var==0]) will return 3, because NA==0 is evaluated as NA.
When you enclose the value you're looking for in quotation marks, R evaluates it as a string. So length(d$var[d$var == "NA"]) is asking how many elements in d$var are the character string "NA". Since there are no characters "NA" in your data set, you get back the number of values that evaluate to NA (because "NA"==NA evaluates to NA).
In order to answer your last question, look at what d$var[d$var==NA] returns: a vector of NA of the same length as your original vector. Again, any == comparison with NA evaluates to NA. Since all of the comparisons in that expression are to NA, you'll get back a vector of NAs that is the same length as your original vector.

Related

Understand how vector subset replacement works in R

I'd like to understand what's going on in this piece of R code I was testing. I'd like to replace part of a vector with another vector. The original and replacement values are in a data.frame. I'd like to replace all elements of the vector that match the original column with the corresponding replacement values. I have the answer to the larger question, but I'm unable to understand how it works.
Here's a simple example:
> vecA <- 1:5;
> vecB <- data.frame(orig=c(2,3), repl=c(22,33));
> vecA[vecA %in% vecB$orig] <- vecB$repl #Question-1
> vecA
[1] 1 22 33 4 5
> vecD<-data.frame(orig=c(5,7), repl=c(55,77))
> vecA[vecA %in% vecD$orig] <- vecD$repl #Question-2
Warning message:
In vecA[vecA %in% vecD$orig] <- vecD$repl :
number of items to replace is not a multiple of replacement length
> vecA
[1] 1 22 33 4 55
Here are my questions:
How does the assignment on Line-3 work? The LHS expression is a 2-item vector, whereas the RHS is a 5-element vector.
Why does the assignment on Line-6 give a warning (but still work)?
The First Question
R goes through each element in vecA and checks to see if it exists in vecB$orig. The %in% operator will return a boolean. If you run the command vecA %in% vecB$orig you get the following:
[1] FALSE TRUE TRUE FALSE FALSE
which is telling you that in the vector 1 2 3 4 5 it sees 2 and 3 in vecB$orig.
By subsetting vecA by this command, you are isolating only the TRUE values in vecA, so vecA[vecA %in% vecB$orig] returns:
[1] 2 3
On the RHS, you are re-assigning wherever vecA[vecA %in% vecB$orig] equals TRUE to vecB$repl, which will replace 2 3 in vecA with 22 33.
The Second Question
In this case, the same logic applies for subsetting, but running vecA[vecA %in% vecD$orig] gives you
[1] 5
as 7 does not exist in vecA. You are trying to replace a vector of length 1 with a vector of length 2, which is what triggers the warning. In this case, it will just replace the first element of vecD$repl which happens to be 55.

Check if value is in data frame

I'm trying to check if a specific value is anywhere in a data frame.
I know the %in% operator should allow me to do this, but it doesn't seem to work the way I would expect when applying to a whole data frame:
A = data.frame(B=c(1,2,3,4), C=c(5,6,7,8))
1 %in% A
[1] FALSE
But if I apply this to the specific column the value is in it works the way I expect:
1 %in% A$C
[1] TRUE
What is the proper way of checking if a value is anywhere in a data frame?
You could do:
any(A==1)
#[1] TRUE
OR with Reduce:
Reduce("|", A==1)
OR
length(which(A==1))>0
OR
is.element(1,unlist(A))
To find the location of that value you can do f.ex:
which(A == 1, arr.ind=TRUE)
# row col
#[1,] 1 1
Or simply
sum(A == 1) > 0
#[1] TRUE
Loop through the variables with sapply, then use any.
any(sapply(A, function(x) 1 %in% x))
[1] TRUE
or following digEmAll's comment, you could use unlist, which takes a list (data.frame) and returns a vector.
1 %in% unlist(A)
[1] TRUE
The trick to understanding why your first attempt doesn't work, really comes down to understanding what a data frame is - namely a list of vectors of equal length. What you're trying to do here is not check if that list of vectors matches your condition, but checking if the values in those vectors matches the condition.
Try:
any(A == 1)
Returns FALSE or TRUE

Using which(), !is.na() and parameter like [1,]

Can someone describe exactly (I understand partially) what the following line does?
which(!is.na(table[1,]))
1) table[1,] = ? line 1 or column 1 or of a file called "table"?
2) !is.na = why the !? (is.na is used to eliminate the NA but why the !? Normally, ! represents negative (not equal).
If we split the function to pieces,
table[1,]
subset the first row of the dataset
is.na(table[1,])
checks whether there are NA values in the first row. It will return a vector of logical elements (TRUE for NA and FALSE for non-NA).
! is negation operator. So, it will convert the TRUE to FALSE and vice versa to give a vector of logical elements that are non NA for TRUE values
!is.na(table[1,])
and lastly the which wrapper gives the numeric index of TRUE values
To demonstrate an example, say we have a matrix
m1 <- matrix(c(NA, 0, 1, 2), 2, 2)
Then, if we follow the steps
m1[1,] #returns the 1st row as a vector
#[1] NA 1
is.na(m1[1,]) #returns TRUE for NA
#[1] TRUE FALSE
!is.na(m1[1,]) #returns TRUE for non-NA elements
#[1] FALSE TRUE
which(!is.na(m1[1,]))
#[1] 2
#or perhaps more usefully
which(is.na(m1[1,]))
#[1] 1

How to distinguish between an element and a vector of length 1 in R?

Is there a way to distinguish between 1 and c(1)? Apparently in R
c(1) == 1 # TRUE
as.matrix(c(1)) == 1 # TRUE
as.array(c(1)) == 1 # TRUE
which is a problem, for example, if I'm converting a vector to JSON:
library(rjson)
toJSON(c(1,2)) # "[1,2]"
toJSON(c(1)) # "1" instead of "[1]"
Any ideas?
It works as expected if you pass a list:
> toJSON(list(1))
[1] "[1]"
You can convert with as.list:
> toJSON(as.list(c(1)))
[1] "[1]"
> toJSON(as.list(c(1, 2)))
[1] "[1,2]"
As noted in the other answers, there is no distinction between an atomic value and a vector of length one in R -- unlike with lists, which always have a length and can contain arbitrary objects, not necessarily of the same type.
No. As far as I know, there's no difference between 1 and c(1)
> identical(1, c(1))
[1] TRUE
The reason rjson::toJSON returns a different value for c(1) than it does for c(1,2) is because it checks the length and returns differently for length 1 objects
In R numbers are just vectors of one entry. There is no distinction.
In fact, a one-element vector is automatically printed as if it were just a scalar:
a<- 1
str(a) # num 1
b<-c(1)
str(b) # num 1
If your output should encode them differently then you will have to do that manually, which you can do because your program is generating both, it knows which are "real" vectors vs. vectors that have 1 element but are conceptually scalar.

How to check if entire vector has no values other than NA (or NAN) in R?

How to check if entire vector has no values other than NA (or NAN) in R ?
If I use is.na it returns a vector of TRUE / FALSE.
I need to check if there is single not NA element or not.
The function all(), when passed a Boolean vector, will tell you whether all of the values in it are TRUE:
> all(is.na(c(NA, NaN)))
[1] TRUE
> all(is.na(c(NA, NaN, 1)))
[1] FALSE

Resources