An NA in subsetting a data.frame does something unexpected - r

Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.
> # A toy example
> a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
> a
col1 col2 col3 col4
1 1 2 1 4
2 2 NA 2 3
3 3 2 3 2
4 4 3 4 1
>
> # Bummer, there's an NA in my condition
> a$col2==2
[1] TRUE NA TRUE FALSE
>
> # Why is this a good thing to do?
> # It NA'd the whole row, and kept it
> a[a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
NA NA NA NA NA
3 3 2 3 2
>
> # Yes, this is the right way to do it
> a[!is.na(a$col2) & a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
>
> # Subset seems designed to avoid this problem
> subset(a, col2 == 2)
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
Can someone explain why the behavior you get without the is.na check would ever be good or useful?

I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as
non-comparable even to themselves, so comparisons involving them
will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match() function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
Update: How is NA handled when there are multiple logical conditions?
NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE

Related

Removing NA’s from a dataset in R

I want to remove all of the NA’s from the variables selected however when I used na.omited() for example:
na.omit(df$livharm)
it does not work and the NA’s are still there. I have also tried an alternative way for example:
married[is.na(livharm1)] <-NA
I have done this for each variable within the larger variable I am looking at using the code:
E.g.
df <- within(df, {
married <- as.numeric(livharm == 1)
“
“
“
married[is.na(livharm1)] <- NA
})
however I’m not sure what I actually have to do. Any help I would greatly appreciate!
Using complete.cases gives:
dat <- data.frame( a=c(1,2,3,4,5),b=c(1,NA,3,4,5) )
dat
a b
1 1 1
2 2 NA
3 3 3
4 4 4
5 5 5
complete.cases(dat)
[1] TRUE FALSE TRUE TRUE TRUE
# is.na equivalent has to be used on a vector for the same result:
!is.na(dat$b)
[1] TRUE FALSE TRUE TRUE TRUE
dat[complete.cases(dat),]
a b
1 1 1
3 3 3
4 4 4
5 5 5
Using na.omit is the same as complete.cases but instead of returning a boolean vector the object itself is returned.
na.omit(dat)
a b
1 1 1
3 3 3
4 4 4
5 5 5
This function returns a different result when applied only to a vector, which is probably not handled correctly by ggplot2. It can be "rescued" by putting it back in a data frame. base plot works as intended though.
na.omit(dat$b)
[1] 1 3 4 5
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
data.frame(b=na.omit(dat$b))
b
1 1
2 3
3 4
4 5
Plotting with ggplot2
ggplot(dat[complete.cases(dat),]) + geom_point( aes(a,b) )
# <plot>
# See warning when using original data set with NAs
ggplot(dat) + geom_point( aes(a,b) )
Warning message:
Removed 1 rows containing missing values (geom_point).
# <same plot as above>

How to add dummy variables to data with specific characteristic

My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1

R tapply: different R releases produce different outputs

The Problem
This a simple tapply example:
z=data.frame(s=as.character(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
On R (Another Canoe) v3.3.3 (2017-03-06) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA NA
On R (You Stupid Darkness) v3.4.0 (2017-04-21) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA ""
R News References
According to
NEWS.R-3.4.0.:
tapply() gets new option default = NA allowing to change the previously hardcoded value.
In this instance instead, it seems like if it defaults to an empty string.
Inconsistencies Among Data Types
The new behavior is inconsistent with the numeric or logical version, where one still gets all NAs:
z=data.frame(s=as.numeric(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
# 1 2
# 1 NA NA
# 2 NA NA
The same is for s=NA, which means s=as.logical(NA).
An Even Worse Case
In a more realistic context the character vector s in z has several values including NAs.
z=data.frame(s=c('a', NA, 'c'), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 c 1 2
# 1 2
# 1 "a" "c"
# 2 NA ""
In general, we might fix this setting missing values for combinations with no values:
m[!nzchar(m)]=NA; m
# 1 2
# 1 "a" "c"
# 2 NA NA
Now when there is no value, such as in (2,2), one correctly gets a NA, like in the old versions.
But what if the input of tapply already has some empty strings?
z=data.frame(s=c('a', NA, ''), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 1 2
# 1 2
# 1 "a" ""
# 2 NA ""
Now there is no way to distinguish between the legal empty string in (1,2) and that artificially added in (2,2) in place of the NA by the new tapply. So we can't apply the fix.
Questions
Is really the new behavior the correct one?
That is, if there is no string for rows=2 and cols=2, why this is not reported as a missing value (NA) and why this is so only for character data types?
Can we rewrite the code above in such a way to get a consistent behavior across R versions?

Check if row of a dataframe is equal to a row of another disregarding NA values

I have two data.frames.
df1 may have several rows which never contain NAs and it looks like:
col1 col2 col3 col4
1 1 2 1
2 1 2 3
3 1 2 1
1 2 2 2
while df2 has always one row that may contain NAs and it looks like this:
col1 col2 col3 col4
1 NA 2 NA
I am looking for a way to test for each line of df1 if df2 is equal to that line diregarding all values that are NA. After the check for the example above I expect to get:
TRUE FALSE FALSE TRUE
So far I have tried several combinations of all() and which() but I haven't found an efficient solution.
Thanks.
You could do:
m1 <- t(df1)
v1 <- unlist(df2)
m1[is.na(v1)] <- NA
colSums(m1==v1|is.na(m1))==ncol(m1)
#[1] TRUE FALSE FALSE TRUE

Tricky multi-step subset selection

I have a matrix:
1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9
...
I would like to select those elements for each number in the first column to which the corresponding value in the second column has an NA in its own second column.
So the search would go the following way:
look up number in the first column: 1.
check corresponding values in second column: 3,2,7,5,9,6...
look up 3,2,7,5,9,6 in first column and see if they have NA in their
second column
The result in the above case would be:
>3 NA 4<
Since this is the only value which has NA in its own second row.
Here's what I want to do in words:
Look at the number in column one, I find '1'.
What numbers does 1 have in its second column: 3,2,7,5,9,6
Do these numbers have NA in their own second column? yes, 3 has an NA
I would like it to return those numbers not row numbers.
the result would be the subset of the original matrix with those rows which satisfy the condition.
This would be the matlab equivalent, where i is the number in column 1:
isnan(matrix(matrix(:,1)==i,2))==1)
Using by, to get the result by group of column 1, assuming dat is your data frame
by(dat,dat$V1,FUN=function(x){
y <- dat[which(dat$V1 %in% x$V2),]
y[is.na(y$V2),]
})
dat$V1: 1
V1 V2 V3
9 3 NA 4
--------------------------------------------------------------------------------
dat$V1: 2
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 3
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 4
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
EDIT
Here I trie to do the same function as matlab command:
here the R equivalent of matlab
isnan(matrix(matrix(:,1)==i,2))==1) ## what is i here
is.na(dat[dat[dat[,1]==1,2],]) ## R equivalent , I set i =1
V1 V2 V3
3 FALSE FALSE FALSE
2 FALSE FALSE FALSE
7 FALSE FALSE FALSE
5 FALSE FALSE FALSE
9 FALSE TRUE FALSE
6 FALSE FALSE FALSE
This hopefully reads easily as it follows the steps you described:
idx1 <- m[, 1L] == 1L
idx2 <- m[, 1L] %in% m[idx1, 2L]
idx3 <- idx2 & is.na(m[, 2L])
m[idx3, ]
# V1 V2 V3
# 3 NA 4
It is all vectorized and uses integer comparison so it should not be terribly slow. However, if it is too slow for your needs, you should use a data.table and use your first column as the key.
Note that you don't need any of the assignments, so if you are looking for a one-liner:
m[is.na(m[, 2L]) & m[, 1L] %in% m[m[, 1L] == 1L, 2L], ]
# [1] 3 NA 4
(but definitely harder to read and maintain.)
I am still not totally clear as to what you want, but maybe this would work?
m<-read.table(
textConnection("1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9"))
do.call(rbind,lapply(split(m[,2],m[,1]),function(x) m[x[!is.na(x)][is.na(m[x[!is.na(x)],2])],]))
# V1 V2 V3
# 1 3 NA 4
It would be much nicer if you provided an example that you want to have more than one row.

Resources