Trying to find movies without directors in a ds on R - r

this is the code im trying to run to find rows where director is not equal to NA:
nodir <- subset(x, director=="NA",
select = c(titles))

Your problem is director=="NA". This logical comparison is defined to return NA. Because NA codes a missing value, NA == NA can be neither TRUE nor FALSE. You want is.na(director).

Related

Subsetting data but getting rows of NA where information should be

full_data = full_data[!(full_data$RIF == 1), ]
I want to subset my dataframe and return all rows where the RIF is not equal to 1. This statement returns a dataframe that has random NA rows where information previously existed and RIF was not 1. Could someone please explain to me why this issue is happening?
It would be an issue with NA in the data. One option is to make use of is.na to take care of those elements having NA to return FALSE or else it would be NA and this creates the NA row in the subset data
full_data[(full_data$RIF !=1 & !is.na(full_data$RIF))| is.na(full_data$RIF), ]

Affect value in R dataframe without checking if the index is empty

df = data.frame(A=c(1,1),B=c(2,2))
df$C = NA
df[is.na(df$B),]$C=5
Each time I want to affect a new value and the indexes found out to be empty like here is.na(df$B) , R raised raises an error replacement has 1 row, data has 0.
Is there a way that R just doesnt affect anything in these case instead of raising an error ?
We can do this in a single line instead of assigning 'C' to NA and then subsetting the data.frame. The below code will assign 5 to 'C' where there are NA elements in 'B' or else it will be NA
df$C[is.na(df$B)] <- 5

Different results for 2 subset data methods in R

I'm subseting my data, and I'm getting different results for the following codes:
subset(df, x==1)
df[df$x==1,]
x's type is integer
Am I doing something wrong?
Thank you in advance
Without example data, it is difficult to say what your problem is. However, my hunch is that the following probably explains your problem:
df <- data.frame(quantity=c(1:3, NA), item=c("Coffee", "Americano", "Espresso", "Decaf"))
df
quantity item
1 Coffee
2 Americano
3 Espresso
NA Decaf
Let's subset with [
df[df$quantity == 2,]
quantity item
2 Americano
NA <NA>
Now let's subset with subset:
subset(df, quantity == 2)
quantity item
2 Americano
We see that there is a difference in sub-setting output depending on how NA values are treated. I think of this as follows: With subset, you are explicitly stating you want the subset for which the condition is verifiably true. df$quantity==2 produces a vector of true/false-statements, but where quantity is missing, it is impossible to assign TRUE or FALSE. This is why we get the following output with an NA at the end:
df$quantity==2
[1] FALSE TRUE FALSE NA
The function [ takes this vector but does not understand what to do with NA, which is why instead of NA Decaf we get NA <NA>. If you prefer using [, you could use the following instead:
df[which(df$quantity == 2),]
quantity item
2 Americano
This translates the logical condition df$quantity == 2 into a vector or row numbers where the logical condition is "verifiably" satisfied.

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
Note this returns both the NA and the value c - which is what I want.
However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.
subset(df, y!='a' | is.na(y))
Thanks
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.
Many people dislike this behavior, but it is what it is.
subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.
From base::Extract:
When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA
From ?base::subset:
missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with [, rows where the condition evaluates to NA are dropped
One workaround is to use %in%:
subset(df, !y %in% "a")
dplyr::filter(df, !y %in% "a")

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

Resources