Different results for 2 subset data methods in R - r

I'm subseting my data, and I'm getting different results for the following codes:
subset(df, x==1)
df[df$x==1,]
x's type is integer
Am I doing something wrong?
Thank you in advance

Without example data, it is difficult to say what your problem is. However, my hunch is that the following probably explains your problem:
df <- data.frame(quantity=c(1:3, NA), item=c("Coffee", "Americano", "Espresso", "Decaf"))
df
quantity item
1 Coffee
2 Americano
3 Espresso
NA Decaf
Let's subset with [
df[df$quantity == 2,]
quantity item
2 Americano
NA <NA>
Now let's subset with subset:
subset(df, quantity == 2)
quantity item
2 Americano
We see that there is a difference in sub-setting output depending on how NA values are treated. I think of this as follows: With subset, you are explicitly stating you want the subset for which the condition is verifiably true. df$quantity==2 produces a vector of true/false-statements, but where quantity is missing, it is impossible to assign TRUE or FALSE. This is why we get the following output with an NA at the end:
df$quantity==2
[1] FALSE TRUE FALSE NA
The function [ takes this vector but does not understand what to do with NA, which is why instead of NA Decaf we get NA <NA>. If you prefer using [, you could use the following instead:
df[which(df$quantity == 2),]
quantity item
2 Americano
This translates the logical condition df$quantity == 2 into a vector or row numbers where the logical condition is "verifiably" satisfied.

Related

Trying to find movies without directors in a ds on R

this is the code im trying to run to find rows where director is not equal to NA:
nodir <- subset(x, director=="NA",
select = c(titles))
Your problem is director=="NA". This logical comparison is defined to return NA. Because NA codes a missing value, NA == NA can be neither TRUE nor FALSE. You want is.na(director).

Subsetting data but getting rows of NA where information should be

full_data = full_data[!(full_data$RIF == 1), ]
I want to subset my dataframe and return all rows where the RIF is not equal to 1. This statement returns a dataframe that has random NA rows where information previously existed and RIF was not 1. Could someone please explain to me why this issue is happening?
It would be an issue with NA in the data. One option is to make use of is.na to take care of those elements having NA to return FALSE or else it would be NA and this creates the NA row in the subset data
full_data[(full_data$RIF !=1 & !is.na(full_data$RIF))| is.na(full_data$RIF), ]

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.
For example,
iris_test = iris
iris_test$Sepal.Length[1] = NA
iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA
It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?
Your assumption is kind of correct that is you get NA values when there is NA in the data.
The comparison yields NA values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA it returns NA. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)
Adding to #Ronak's
The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:
# Is unknown greater than 0? Result is unknown (NA)
NA > 0
#NA
# Is unknown less than 0? Output is unknown (NA).
NA < 0
# NA
# Is unknown equal to unknown? Output is unknown(NA).
NA == NA
# NA
Getting back to your data, when you do:
iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).
The question is "Is unknown less than 0?".
The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).
There is a function called is.na() which I'm sure you're aware of to handle missing values.
Hope that adds some insight to your question.

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
Note this returns both the NA and the value c - which is what I want.
However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.
subset(df, y!='a' | is.na(y))
Thanks
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.
Many people dislike this behavior, but it is what it is.
subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.
From base::Extract:
When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA
From ?base::subset:
missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with [, rows where the condition evaluates to NA are dropped
One workaround is to use %in%:
subset(df, !y %in% "a")
dplyr::filter(df, !y %in% "a")

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

Resources