How can values be assigned to the output of is.na()? - r

Following is related to R language.
x1 <- c(1, 4, 3, NA, 7)
is.na(x1) <- which(x1 == 7)
I don't undertand, the LHS in last line gives you a vector of boolean and RHS is a value(index where x ==7, 5 in this case). So what does it mean to assign a boolean vector a value of 5?

is.na from the docs returns:
The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x, containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN, and FALSE otherwise.
Therefore, by making a logical vector(you're in essence saying wherever an index is TRUE, this should be an NA.
By "matching" these indices to the corresponding index from which, you're turning the latter into NAs wherever FALSE hence the change.
To put it in practice:
This is the output from is.na(x1):
is.na(x1)
[1] FALSE FALSE FALSE TRUE FALSE
The corresponding output from which(x==7):
which(x1 == 7)
[1] 5
Combining, the element at position 5 will now become an NA because it has been given the logical is.na() which returns TRUE
is.na(x1) <- which(x1 == 7)
x1
[1] 1 4 3 NA NA
The above turns the first index into an NA and appends two more NAs so as to make index 7 and NA.
This can be best seen by:
is.na(x1) <- c(1,7)
x1
[1] NA 4 3 NA 7 NA NA
Compare with this example from the docs:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx
[1] 0 NA 2 NA 4
From the above, it is clear that c(2,4) follows the original indices in xx hence the rest become NAs.

Related

In R, why does subsetting a negative numeric value of length 1 result in widely different results depending on what you subset it on?

In R, I saw that if we subset a negative number by negative values, we get -1. If somehow a 1 is placed in, we get numeric(0), and if positive numbers are the indices, we get NA's. Why is this?
> V <- -1
> V[-c(3,4)]
[1] -1
> V[-c(1,3,4)]
numeric(0)
> V[c(1,3,4)]
[1] -1 NA NA
In the second an third case, the actual index was present, and it results in removing that element to results in numeric(0) for the second case and in third with positive index, third and fourth doesn't exist and gives NA
c(1, 4, 3)[c(5, 6)] # // it is vector of length 3, so 5th and 6th doesn't exist
#[1] NA NA
c(1, 4, 3)[-c(5, 6)] # // no values in 5th and 6th to remove
#[1] 1 4 3 # // so it returns the original vector
In the OP's case
V[-1] # // returns numeric(0) as the first and only element is removed
#numeric(0)

Using which(), !is.na() and parameter like [1,]

Can someone describe exactly (I understand partially) what the following line does?
which(!is.na(table[1,]))
1) table[1,] = ? line 1 or column 1 or of a file called "table"?
2) !is.na = why the !? (is.na is used to eliminate the NA but why the !? Normally, ! represents negative (not equal).
If we split the function to pieces,
table[1,]
subset the first row of the dataset
is.na(table[1,])
checks whether there are NA values in the first row. It will return a vector of logical elements (TRUE for NA and FALSE for non-NA).
! is negation operator. So, it will convert the TRUE to FALSE and vice versa to give a vector of logical elements that are non NA for TRUE values
!is.na(table[1,])
and lastly the which wrapper gives the numeric index of TRUE values
To demonstrate an example, say we have a matrix
m1 <- matrix(c(NA, 0, 1, 2), 2, 2)
Then, if we follow the steps
m1[1,] #returns the 1st row as a vector
#[1] NA 1
is.na(m1[1,]) #returns TRUE for NA
#[1] TRUE FALSE
!is.na(m1[1,]) #returns TRUE for non-NA elements
#[1] FALSE TRUE
which(!is.na(m1[1,]))
#[1] 2
#or perhaps more usefully
which(is.na(m1[1,]))
#[1] 1

When subsetting rows with a factor with equal (==), NA's are also included. It doesn't happen with %in%. Is it normal?

Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do
subset1 <- df[df$A=="A1",]
dim(subset1) # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),]
dim(subset2) # 10, as expected
summary(subset2$A) # only A1 has non-zero count
And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!
Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...
# Data...
x <- c("A",NA,"A")
# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE NA TRUE
x[x=="A"]
#[1] "A" NA "A"
# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1] TRUE FALSE TRUE
x[ x %in% "A" ]
#[1] "A" "A"
This is because (from the docs)...
%in% is an alias for match, which is defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
If we redefine it to the standard definition of match you will see that it behaves in the same way as ==
"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE NA TRUE
There's a mismatch here between what you want (only the entries that match your filtering) and what R does.
The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.
Consider these cases:
x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]
Versus:
x[x < 5]
y[y < 5]
And
y < 5
It is because of this behavior that I almost never use v[logicalCondition] and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition). If you want NAs, you can use which(logicalCondition | is.na(v)).

summary still shows NAs after using both na.omit and complete.cases

I am a grad student using R and have been reading the other Stack Overflow answers regarding removing rows that contain NA from dataframes. I have tried both na.omit and complete.cases. When using both it shows that the rows with NA have been removed, but when I write summary(data.frame) it still includes the NAs. Are the rows with NA actually removed or am I doing this wrong?
na.omit(Perios)
summary(Perios)
Perios[complete.cases(Perios),]
summary(Perios)
The error is that you actually didn't assign the output from na.omit !
Perios <- na.omit(Perios)
If you know which column the NAs occur in, then you can just do
Perios[!is.na(Perios$Periostitis),]
or more generally:
Perios[!is.na(Perios$colA) & !is.na(Perios$colD) & ... ,]
Then as a general safety tip for R, throw in an na.fail to assert it worked:
na.fail(Perios) # trust, but verify! Die Paranoia ist gesund.
is.na is not the proper function. You want complete.cases and you want complete.cases which is the equivalent of function(x) apply(is.na(x), 1, all) or na.omit to filter the data:
That is, you want all rows where there are no NA values.
< x <- data.frame(a=c(1,2,NA), b=c(3,NA,NA))
> x
a b
1 1 3
2 2 NA
3 NA NA
> x[complete.cases(x),]
a b
1 1 3
> na.omit(x)
a b
1 1 3
Then this is assigned back to x to save the data.
complete.cases returns a vector, one element per row of the input data frame. On the other hand, is.na returns a matrix. This is not appropriate for returning complete cases, but can return all non-NA values as a vector:
> is.na(x)
a b
[1,] FALSE FALSE
[2,] FALSE TRUE
[3,] TRUE TRUE
> x[!is.na(x)]
[1] 1 2 3

variable in R for loop

I am running a loop in R to find indices of a vector when its elements are equal to elements of a reference vector.
As far as I know R, I need to declare the variable before the for-loop, but in this case I do not know the final length of my indices vector (see code below).
How can I create a variables that allows R to change its size during the for loop?
extract of my code:
k <- 1
for(i in 1:length(Lid.time)){
ind <- which(Net.time==Lid.time[i])
if(length(ind)>0){
ind.Net[k] <- ind
k <- k+1
}
}
Notes about the code:
Lid.time is a vector of a different lenght than Net.time.
I need to find an array of indices that tells me where Net.time is equal to Lid.time. I do not know in advance how long will the ind.Net vector will be, so how can I declare the vector ind.Net?
Thanks for your help
As Dason stated, match will work just fine for that specific task:
>a <- seq(2,20,2)
#[1] 2 4 6 8 10 12 14 16 18 20
>b <- c(4,14,18)
>match(b,a)
#[1] 2 7 9 # The indices!
>a %in% b #shorthand logical version of match
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
But to answer your question of a vector of unknown length within a loop:
Vector <- c()
for(i in sample(1:100,20)) {
if(i<50) {Vector <- append(Vector, i)}
}
length(HowLongIsThisVector)
It will be different every time you run it because of sample.
No need for a loop as it sounds like match does what you want.
a <- 1:10
b <- c(2, 7, 9)
match(a, b)
# [1] NA 1 NA NA NA NA 2 NA 3 NA

Resources