Subsetting data but getting rows of NA where information should be - r

full_data = full_data[!(full_data$RIF == 1), ]
I want to subset my dataframe and return all rows where the RIF is not equal to 1. This statement returns a dataframe that has random NA rows where information previously existed and RIF was not 1. Could someone please explain to me why this issue is happening?

It would be an issue with NA in the data. One option is to make use of is.na to take care of those elements having NA to return FALSE or else it would be NA and this creates the NA row in the subset data
full_data[(full_data$RIF !=1 & !is.na(full_data$RIF))| is.na(full_data$RIF), ]

Related

extracting identifiers from row observations

I want to extract specific elements, specifically ID, from rows that have NAs. Here is my df:
df
ID x
1-12 1
1-13 NA
1-14 3
2-12 20
3-11 NA
I want a dataframe that has the IDs of observations that are NA, like so:
df
ID x
1-13 NA
3-11 NA
I tried this, but it's giving me a dataframe with the row #s that have NAs (e.g., row 2, row 5), not the IDs.
df1 <- data.frame(which(is.na(df$x)))
Can someone please help?
This is a very basic subsetting question:
df[is.na(df$x),]
Good basic and free guides can be found on w3schools: https://www.w3schools.com/r/
Cheers
Hannes
Simply run the following line:
df[is.na(x),]
Another option is complete.cases
subset(df, !complete.cases(x))
Here is another base R option using na.omit
> df[!1:nrow(df) %in% row.names(na.omit(df)), ]
ID x
2 1-13 NA
5 3-11 NA

Trying to find movies without directors in a ds on R

this is the code im trying to run to find rows where director is not equal to NA:
nodir <- subset(x, director=="NA",
select = c(titles))
Your problem is director=="NA". This logical comparison is defined to return NA. Because NA codes a missing value, NA == NA can be neither TRUE nor FALSE. You want is.na(director).

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.
For example,
iris_test = iris
iris_test$Sepal.Length[1] = NA
iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA
It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?
Your assumption is kind of correct that is you get NA values when there is NA in the data.
The comparison yields NA values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA it returns NA. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)
Adding to #Ronak's
The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:
# Is unknown greater than 0? Result is unknown (NA)
NA > 0
#NA
# Is unknown less than 0? Output is unknown (NA).
NA < 0
# NA
# Is unknown equal to unknown? Output is unknown(NA).
NA == NA
# NA
Getting back to your data, when you do:
iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).
The question is "Is unknown less than 0?".
The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).
There is a function called is.na() which I'm sure you're aware of to handle missing values.
Hope that adds some insight to your question.

Different results for 2 subset data methods in R

I'm subseting my data, and I'm getting different results for the following codes:
subset(df, x==1)
df[df$x==1,]
x's type is integer
Am I doing something wrong?
Thank you in advance
Without example data, it is difficult to say what your problem is. However, my hunch is that the following probably explains your problem:
df <- data.frame(quantity=c(1:3, NA), item=c("Coffee", "Americano", "Espresso", "Decaf"))
df
quantity item
1 Coffee
2 Americano
3 Espresso
NA Decaf
Let's subset with [
df[df$quantity == 2,]
quantity item
2 Americano
NA <NA>
Now let's subset with subset:
subset(df, quantity == 2)
quantity item
2 Americano
We see that there is a difference in sub-setting output depending on how NA values are treated. I think of this as follows: With subset, you are explicitly stating you want the subset for which the condition is verifiably true. df$quantity==2 produces a vector of true/false-statements, but where quantity is missing, it is impossible to assign TRUE or FALSE. This is why we get the following output with an NA at the end:
df$quantity==2
[1] FALSE TRUE FALSE NA
The function [ takes this vector but does not understand what to do with NA, which is why instead of NA Decaf we get NA <NA>. If you prefer using [, you could use the following instead:
df[which(df$quantity == 2),]
quantity item
2 Americano
This translates the logical condition df$quantity == 2 into a vector or row numbers where the logical condition is "verifiably" satisfied.

R: if a value is less or is na update another data.frame

I have two data.frames A and B.
A contains negative, absolute and NA values.
B contains only positive and NA values.
The dimensions of the data frames are the same.
data.frame A looks like this:
ENSMUSG00000000001.4/Gnai3 0.1943315 0.3021675 NA NA
ENSMUSG00000000003.9/Pbsn -1.4843914 -1.2608270 -0.2587953 -0.46167430
ENSMUSG00000000028.8/Cdc45 -0.2388901 -0.1106236 0.9046436 0.08968331
ENSMUSG00000000037.9/Scml 0.3242902 0.5385371 0.2311202 0.51110287
ENSMUSG00000000049.5/Apoh -1.7606033 -1.8159545 -0.2087083 -1.09614630
ENSMUSG00000000056.7/Narf NA NA -0.3747798 -0.55547798
I need to check if a value is NA or negative in this table then I need to update data.frame B on the same indices to the value 0.999.
For example:
The first record of A has two NA values, indexes are [1,4] and [1,5] meaning, I will update B[1,4]=0.999 and B[1,5]=0.999.
I could do this in the nested loops for columns and rows but it would take too much time. Is there a faster way?
You can pass a Boolean mask as an index if it's the same size:
b[is.na(a) | a < 0] <- 0.999
I would use ifelse to do this, since the dataframes have the same dimensions.
A<-matrix(data=1:15,nrow=5) # create matrices (works with dataframe as well)
B<-matrix(data=16:30,nrow=5)
B[1,2]<-NA # introduce some NA and negative values
B[5,3]<-(-1)
ifelse(is.na(B) | B<=0,A,B) # new matrix with "updated" values

Resources