Don't Select For NA Values - r

This is a minor nuisance in R that I'm looking to see if there is possibly some default case for.
I create the following vector:
x <- c(1, 2, 1, NA)
I now want to select from x, only values equal to 1. I do as such:
x2 <- x[x == 1]
Now, when you see what's in x2 it has the values:
> x2
[1] 1 1 NA
It seems that R defaults to include NA values regardless of the condition. I would like it so that R by default excludes NA values from conditions (as it is true that NA does not satisfy the condition x == 1).
I'm aware of the complete.cases function, used as such:
x2 <- x[complete.cases(x == 1)]
The desired output would be the result of the complete.cases method as such:
[REMOVED CAUSE I MESSED THIS UP]
Which solves my problem, but I am curious to see if there is a setting in options or something like that where I can default R to not include NAs in a boolean condition.
I would like to see if there is a way to set are so that x2 <- x[x == 1] results in the same as x2 <- x[complete.cases(x == 1)]. Currently the difference is that the non-complete.cases (normal) method allows NAs through and I would like that to not be the case.
Hey, sorry, I realized I messed up my output with complete.cases as many of you have said, I essentially want to see if I can make this:
> x <- c(1, 2, 1, NA)
> x2 <- x[x == 1 & !is.na(x)]
> x2
[1] 1 1
Work with just this: x2 <- x[x == 1]. Can I make it so that R automatically ignores NAs. I could create a function to do this, but wanted to see if there is something built in R for single conditions that ignore NAs.

Do you need which?
> x <- c(1, 2, 1, NA)
> x[which(x==1)]
[1] 1 1
To explain, which(x==1) will give you the locations in your vector x that matches the test, x==1. You use this result to subset x, giving the output.
> which(x==1)
[1] 1 3

I don't see the functional value of returning a vector which contains the elements equal to 1, because all these elements will be 1. Rather, in production you most likely would be returning a boolean vector, where TRUE means a value matches and FALSE means a value does not match.
If you want NA values to show up as being FALSE for not matching, then you can make the comparison and simply replace these NA values with false:
x <- c(1, 2, 1, NA)
x2 <- x == 1
x2[is.na(x2)] <- FALSE
> x2
[1] TRUE FALSE TRUE FALSE

Related

How to delete from vector using ifelse condition in R

I have a vector a with values (1,2,3,4) and another vector b with values (1,1,0,1). Using the elements in b as a flag, I want to remove the vector elements from A at the same positions where 0 is found in element b.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
if(b[i] == 0)
{
a <- a[-i]
}
}
I get the desired output
a
[1] 1 2 4
But using ifelse, I do not get the output as required.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
a <- ifelse(b[i] == 0,a[-i],a)
}
Output:
a
[1] 1
How to use ifelse in such situations?
I think ifelse isn't the correct function here since ifelse gives output of same length as input and we want to subset values here. You don't need a loop as well. You can directly do
a[b != 0]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)
Another option could be:
a[as.logical(b)]
[1] 1 2 4
If you want to use ifelse, you can use the following code
na.omit(ifelse(b==0,NA,a))
such that
> na.omit(ifelse(b==0,NA,a))
[1] 1 2 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"
We can also use double negation
a[!!b]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)

How can values be assigned to the output of is.na()?

Following is related to R language.
x1 <- c(1, 4, 3, NA, 7)
is.na(x1) <- which(x1 == 7)
I don't undertand, the LHS in last line gives you a vector of boolean and RHS is a value(index where x ==7, 5 in this case). So what does it mean to assign a boolean vector a value of 5?
is.na from the docs returns:
The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x, containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN, and FALSE otherwise.
Therefore, by making a logical vector(you're in essence saying wherever an index is TRUE, this should be an NA.
By "matching" these indices to the corresponding index from which, you're turning the latter into NAs wherever FALSE hence the change.
To put it in practice:
This is the output from is.na(x1):
is.na(x1)
[1] FALSE FALSE FALSE TRUE FALSE
The corresponding output from which(x==7):
which(x1 == 7)
[1] 5
Combining, the element at position 5 will now become an NA because it has been given the logical is.na() which returns TRUE
is.na(x1) <- which(x1 == 7)
x1
[1] 1 4 3 NA NA
The above turns the first index into an NA and appends two more NAs so as to make index 7 and NA.
This can be best seen by:
is.na(x1) <- c(1,7)
x1
[1] NA 4 3 NA 7 NA NA
Compare with this example from the docs:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx
[1] 0 NA 2 NA 4
From the above, it is clear that c(2,4) follows the original indices in xx hence the rest become NAs.

How to choose only positive values?

I have a dataset which contains positive, negative as well as NA values. How could I select positive-only values using a script? I would also like to replace negatives numbers with NA and leave NA values as they are.
You could use the which function:
sample <- c(1, 2, -7, NA, NaN)
sample[which(sample > 0)]
[1] 1 2
For negative values assign NA.
Using which:
sample[which(sample < 0)] <- NA
You could try the following command:
> x<-c(1,2,3,-5)
> x[x>0]
[1] 1 2 3
would return all the positive values.
To replace negative numbers with NA use
> x <- ifelse(x<0, NA,x)
> x
[1] 1 2 3 NA
Another way to select positive values would be to use sign
x[sign(x) == 1]
and we can combine both these in Filter
Filter(function(i) i > 0, x)
Filter(function(i) sign(i) == 1, x)

When subsetting rows with a factor with equal (==), NA's are also included. It doesn't happen with %in%. Is it normal?

Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do
subset1 <- df[df$A=="A1",]
dim(subset1) # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),]
dim(subset2) # 10, as expected
summary(subset2$A) # only A1 has non-zero count
And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!
Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...
# Data...
x <- c("A",NA,"A")
# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE NA TRUE
x[x=="A"]
#[1] "A" NA "A"
# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1] TRUE FALSE TRUE
x[ x %in% "A" ]
#[1] "A" "A"
This is because (from the docs)...
%in% is an alias for match, which is defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
If we redefine it to the standard definition of match you will see that it behaves in the same way as ==
"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE NA TRUE
There's a mismatch here between what you want (only the entries that match your filtering) and what R does.
The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.
Consider these cases:
x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]
Versus:
x[x < 5]
y[y < 5]
And
y < 5
It is because of this behavior that I almost never use v[logicalCondition] and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition). If you want NAs, you can use which(logicalCondition | is.na(v)).

Strange thing happening: longer object length is not a multiple of shorter object length

x1 = c(1,2,3,4,5,6,7)
x1
[1] 1 2 3 4 5 6 7
x1[which(x1== c(5,6))]
[1] 5 6
Warning message:
In x1 == c(5, 6) :
longer object length is not a multiple of shorter object length
When I exit R and then reopen R I get this:
x1 = c(1,2,3,4,5,6,7)
x1
[1] 1 2 3 4 5 6 7
x1[which(x1== c(5,6))]
[1] 5 6
The warnings message disappears. Why?
There are a few things to note here:
You should be getting that message because for exactly the reason that it says - the longer item's length isn't a multiple of the shorter item's length. This implies that what you think you're doing probably isn't what you're actually doing. You should receive this message every time you try to run that code - I don't know why you wouldn't have received the message one time you ran it.
You can index a vector using logical values so using which is unnecessary here.
What you're most likely looking for in the %in% operator. What you're currently doing is element by element comparison of equality and the shorter vector will 'recycle' itself until it's the same length as the longer vector. For example:
x1 <- c(1, 2)
x2 <- c(1, 2, 3, 4)
x1 == x2
#[1] TRUE TRUE FALSE FALSE
What this is doing is testing x1[1] against x2[1], then x1[2] against x2[2], then since there are no more elements in x1 it recycles back to the beginning and tests x1[1] against x2[3], then x1[2] against x2[4].
If instead we just wanted to find which elements of x1 are in the vector x2 then as mentioned previously the %in% operator takes care of that for us:
x1 %in% x2
#[1] TRUE TRUE
This is asking is x1[1] an element of x2? Is x1[2] an element of x2? So on and so forth...

Resources